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ysis  of  experimental  data.  A  minor  amount  of  research  with  classical  models 
was  in  the  area  of  test-score  equating.  Classical  item  analysis  procedures, 
however,  received  little  attention.  A  fair  amount  of  research  during  the  pe¬ 
riod  was  devoted  to  different  item  types  and  test  item  response  modes  as  re¬ 
placements  for  the  ubiquitous  multiple-choice  item.  Several  types  of  true- 
false  items  were  proposed,  and  formula  scoring  was  studied  by  a  number  of  re¬ 
searchers  in  an  attempt  to  reduce  guessing  effects.  The  perennial  topic  of 
response  option  weighting  received  attention,  with  efforts  oriented  toward 
demonstrating  effects  on  validity  and  reliability.  Response  modes  studied 
included  answer-until-correct ,  confidence  weighting,  and  free-response. 

A  number  of  alternatives  to  classical  test  theory  were  studied  in  an  at¬ 
tempt  to  solve  some  of  the  problems  for  which  classical  test  theory  has  proven 
to  be  inadequate. ^Research  on  criterion-referenced  testing  continued  during 
this  period.  Latent  trait  test  theory  (item  response  theory,  or  IRT)  received 
considerable  attention.  Research  on  the  1-parameter  IRT  model  continued  to 
address  problems  of  parameter  estimation,  model  fit,  and  equating.  The  ques¬ 
tion  of  the  person-free  and  sample-free  characteristics  of  this  model  (i.e., 
its  robustness)  were  investigated,  with  results  generally  supporting  these 
desirable  characteristics.  In  addition,  a  special  case  of  this  model  that  can 
account  for  guessing  was  developed,  and  the  model  was  generalized  and  success¬ 
fully  applied  to  polychotomous  attitude  types  of  items.  Considerable  research 
occurred  on  the  2-  and  3-parameter  IRT  models.  The  concept  of  information  as 
a  replacement  for  classical  reliability  concepts  was  studied,  and  its  uses  in 
developing  parallel  tests  were  described.  As  with  the  1-parameter  IRT  model, 
problems  of  parameter  estimation  and  equating  were  investigated.  These  IRT 
models  were  successfully  applied  to  problems  of  item  option  weighting  and 
adaptive  testing.  Important  developments  with  these  models  during  the  period 
included  the  demonstration  of  their  relationship  with  other  psychological  mea¬ 
surement  models,  and  methods  for  determining  fit  of  individuals  to  IRT  models. 
As  another  alternative  to  classical  test  theory,  order  models  were  developed 
and  studied,  and  several  other  models  were  proposed. 

Validity  issues  were  also  studied  during  this  period.  A  number  of  ap¬ 
proaches  to  the  analysis  of  multitrait-multimethod  matrices  were  proposed  and 
compared,  including  some  based  on  structural  equations  models.  Issues  of  pre¬ 
dictive  validity  studied  included  necessary  sample  sizes,  validity  generaliza¬ 
tion,  and  moderator  and  suppressor  effects.  Test  fairness  issues  and  their 
effects  on  validity  received  considerable  attention.  Concern  was  with  (1) 
bias  in  selection;  (2)  fairness  to  minorities,  including  differential  and  sin¬ 
gle-groups  validity  and  comparisons  of  regression  lines,  adverse  impact,  and 
bias  in  test  content;  and  (3)  fairness  to  women. 

It  is  concluded  that  little  of  consequence  was  accomplished  in  classical 
test  theory  during  this  period.  The  most  important  developments  were  in  al¬ 
ternatives  to  classical  test  theory,  primarily  item  response  theory.  Research 
in  this  area  resulted  in  data  and  other  developments  that  will  permit  a  better 
understanding  of  the  range  of  applicability  of  these  models  and  their  poten¬ 
tial  for  solving  measurement  problems  not  solvable  by  classical  models. 
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REVIEW  OF 

Test  Theory  and  Methods 


This  review  is  concerned  with  the  period  January  1975  through  December 
1979,  including  a  few  papers  published  in  early  1980.  The  primary  focus  is  the 
published  journal  literature,  although  some  books,  technical  reports,  and  unpub¬ 
lished  literature  have  been  included  where  relevant.  The  focus  of  the  review  is 
on  practical  procedures  for  converting  psychological  observations  into  numerical 
form,  commonly  referred  to  as  "test  theory."  Both  the  theory  and  the  resulting 
methodologies  are  reviewed.  Excluded  are  procedures  commonly  used  for  attitude 
scaling,  both  unidimensional  and  mul  dimensional.  However,  some  scaling  meth¬ 
ods  that  either  have  relationships  with  or  utility  for  testing  in  the  ability 
and  achievement  domains  have  been  included,  even  though  they  may  technically  be 
considered  to  be  scaling  methods.  Also  not  incuded  is  the  considerable  litera¬ 
ture  on  data  analytic  procedures  such  as  factor  analysis,  multiple  regression, 
most  of  the  literature  on  structural  equations  analysis,  and  statistical  proce¬ 
dures,  which  are  considered  by  some  to  be  part  of  psychological  measurement. 

The  review  also  does  not  include  the  growing  literature  on  problems  of  reliabil¬ 
ity  of  observations  (e.g.  interrater  reliability)  or  such  measurement  approaches 
as  functional  measurement,  which  have  had  little  application  to  the  general 
problem  of  measuring  individual  differences.  Thus,  the  review  is  concerned  with 
procedures  for  the  measurement  of  ability,  aptitude,  and  other  cognitive  vari¬ 
ables  and  for  problems  of  estimating  the  precision  and  utility  (validity)  of 
measurements  of  this  type. 


CLASSICAL  TEST  THEORY  AND  METHODS 


Classical  test  theory  (CTT),  which  has  its  roots  in  work  by  Spearman  in  the 
early  1900s,  now  is  approximating  its  75th  birthday.  Despite  Lumsden's  (1976) 
critique  of  CTT,  research  related  to  it  seems  to  continue  unabated.  Perhaps 
this  attests  to  the  usefulness  of  this  approach  to  instrument  construction,  or 
perhaps  it  attests  to  the  inertia  built  into  a  system  of  education  and  training 
that  produces  researchers  who  continue  to  perpetuate  methodologies  which,  al¬ 
though  useful,  can  be  replaced  by  more  coherent  methodologies. 


Research  on  reliability  estimation  in  CTT  continues  to  focus  on  minor  modi¬ 
fications  of  old  standby  coefficients.  Thus,  Huck  (1978a)  has  modified  Hoyt’s 
analysis  of  variance  reliability  estimation  procedure  (originally  developed  in 
1941)  to  better  estimate  the  "true"  reliability  coefficient.  The  result  is  a 
higher  reliability  estimate  by  better  specifying  the  error  variance,  which  Hoyt 
originally  defined  as  interaction  of  persons  and  items.  Kaiser  and  Michael 
(1977)  show  that  the  "old  faithful"  Kuder-Richardson  Formula  20  (K-R  20)  can  be 
estimated  from  factor  scores  derived  from  "Little  Jiffy"  factor  analysis.  In 
two  closely  related  papers,  Raju  (1977b,  1979)  generalizes  coefficient  alpha  (a 
spcial  case  of  both  K-R  20  and  Hoyt's  coefficient)  to  the  reliability  coeffi¬ 
cient  for  a  "test  battery."  Although  the  development  seems  mathematically  ap¬ 
propriate,  Raju  does  not  attempt  to  describe  what  the  reliability  of  the  test 
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battery  means;  this  lack  of  coherence  is  characteristic  of  much  of  the  research 
on  reliability  coefficients  in  CTT.  Is  a  test  battery  reliable  when  the  tests 
are  all  highly  intercorrelated?  If  so,  of  what  use  is  such  a  test  battery? 

What  kind  of  standard  error  of  measurement  can  be  derived  from  the  reliability 
coefficient  for  a  test  battery?  How  would  such  a  standard  error  of  measurement 
be  used  and  interpreted  in  any  practical  situation?  These  are  some  of  the  kinds 
of  questions  that  need  to  be  considered  with  regard  to  the  development  of  such  a 
generalization  of  coefficient  alpha. 

CTT  psychometricians  seem  to  be  playing,  "Will  the  real  lower  bound  please 
stand  up?"  This  compulsion  has  been  pursued  by  ten  Berge  and  Zegers  (1978), 
Jackson  and  Agunwamba  (1977),  Nicewander  (1975),  and  Woodhouse  and  Jackson 
(1977).  These  papers  "improve  upon"  work  done  by  Guttman  in  the  1940s  by  at¬ 
tempting  to  better  estimate  the  true  reliability  from  a  given  data  set.  Nice¬ 
wander  (1975)  shows  a  relationship  between  image  factor  analysis  and  one  of 
Guttman* s  lower  bounds.  The  search  is  extended  by  Jackson  (1979)  from  internal 
consistency  coefficients  to  split-half  coefficients,  even  though  more  appropri¬ 
ate  methods  exist  for  estimating  internal  consistency  of  a  set  of  items.  Meth¬ 
ods  for  better  estimating  split-half  reliability  were  further  studied  by  Callen¬ 
der  and  Osburn  (1977a)  who  developed  an  algorithm  to  generate  the  maximum  split- 
half  reliability  for  a  set  of  test  items.  They  subsequently  (Callender  &  Osburn 
1977b)  show  that  their  sample-based  maximized  split-half  routine  gives  a  better 
estimate  of  population  reliability  than  do  some  of  the  internal  consistency 
methods,  but  under  the  unrealistic  conditions  of  tau  equivalence,  i.e.  a  linear 
relation  between  true  scores  on  the  two  halves. 

Of  course,  the  Spearman-Brown  (S-B)  formula,  now  well  into  middle  age,  is 
still  a  topic  of  research  in  CTT.  Allison  (1975)  generalizes  the  formula  to 
fractional  length  tests,  and  Feldt  (1975)  provides  a  formula  for  the  situation 
in  which  the  assumption  of  equal  variances  is  not  met.  Similar  to  Callender  and 
Osburn's  (1977a,  1977b)  approach,  Feldt's  coefficient  makes  the  relatively 
strong  assumption  that  true  scores  on  the  two  subtests  are  perfectly  correlated. 
Another  sample-based  optimization  procedure  was  presented  by  Huck  (1978b)  in  his 
solution  to  the  problem  of  estimating  reliability  when  items  are  equally  diffi¬ 
cult.  The  only  really  "new"  reliability  estimation  procedure  to  appear  during 
this  period  is  a  maximum  likelihood  factor  analytic  method  developed  by  Werts, 
Rock,  Linn,  and  Joreskog  (1978). 

Common  to  all  these  reliability  estimation  procedures,  however,  is  a  major 
weakness  of  CTT — the  sample-based  nature  of  all  of  the  estimation  procedures 
used  for  reliability.  Thus,  reliability  estimates  are  specifically  a  function 
of  the  particular  set  of  items  and  particular  sample  of  individuals  on  which  the 
data  have  been  collected.  The  logical  fallacy,  of  course,  is  that  the  transla¬ 
tion  of  reliability  coefficients  into  errors  of  measurement  results  in  errors  of 
measurement  that  are  specific  to  a  particular  test  administration  event.  Conse¬ 
quently,  the  same  individual  tested  with  two  different  groups  of  individuals  may 
obtain  two  different  errors  of  measurement  and  estimates  of  true  score,  based 
simply  on  the  group  of  testees  with  which  the  person  has  been  tested.  This  is  a 
serious  problem  in  CTT  which  cannot  adequately  be  solved  by  sample-based  methods 
for  determining  test  scores  or  estimates  of  precision  of  measurement. 


Nevertheless,  CTT  marches  on.  Much  of  the  reliability  research  continues 
to  concentrate  on  coefficient  alpha,  with  a  recent  salutary  trend  toward  methods 
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for  testing  the  significance  of  alpha  or  testing  the  difference  in  alpha  coeffi¬ 
cients  from  different  groups.  Pandy  and  Hubert  (1975)  compare  several  interval 
estimation  procedures  for  coefficient  alpha,  and  Joe  and  Woodward  (1975)  provide 
an  approximate  confidence  interval  for  maximum  coefficient  alpha,  based  on  work 
by  Lord  (1958),  which  showed  maximum  alpha  to  be  a  function  of  the  item  inter¬ 
correlations  for  a  set  of  test  items.  Woodward  and  Bentler  (1978)  provide  a 
statistical  lower  bound  for  population  reliability  that  is  useful  in  estimating 
population  values  of  reliability  from  sample  estimates,  which  are  usually  higher 
due  to  sampling  error.  They  use  the  sampling  distribution  of  estimated  alpha 
coefficients  to  obtain  a  new  coefficient  which  better  estimates  the  population 
reliability.  Two  new  reliability  coefficients  and  one  old  one  are  estimated  by 
Sedere  and  Feldt  (1977)  in  comparison  to  the  theoretical  distribution  of  alpha; 
these  authors  define  conditions  under  which  each  of  the  estimates  of  reliability 
studied  appears  to  be  appropriate. 

One  of  the  most  useful  developments  during  this  period  is  a  test  for  alpha 
coefficients  on  independent  samples  (Hakstian  &  Whalen  1976)  that  is  useful  for 
comparing  the  alpha  coefficients  derived  from  different  groups  of  individuals, 
such  as  individuals  in  different  treatment  conditions.  Their  development  is 
supported  by  simulation  data,  and  is  useful  since  it  allows  conclusions  to  be 
drawn  about  the  effects  of  testing  conditions  on  measurement  precision,  an  area 
of  research  which  has  not  received  much  attention  in  past  years.  Thus,  although 
many  authors  have  hypothesized  the  effects  of  testing  conditions  on  the  preci¬ 
sion,  or  reliability,  of  measurement,  the  existence  of  a  statistical  test  for 
independent  samples  to  compare  such  reliability  coefficients  is  a  useful  devel¬ 
opment  . 

Another  major  problem  with  CTT  has  been  in  the  confusion  that  it  has  engen¬ 
dered  among  the  concepts  of  internal  consistency,  homogeneity,  and  unidimension¬ 
ality.  This  is  exemplified  by  the  paper  by  Green,  Lissitz,  and  Mulaik  (1977), 
in  which  homogeneity  and  unidiraensionality  are  equated  as  follows:  "Homogeneous 
items  have  but  a  single  common  factor  among  them  and  are  related  to  the  underly¬ 
ing  factor  of  ability  or  attitude  in  a  linear  matter"  (p.  830),  whereas  internal 
consistency  is  defined  as  "interelatedness  but  not  not  necessarily  unidimension¬ 
ality."  A  related  article  by  Terwiiliger  and  Lele  (1977)  attempts  to  clarify 
the  relationships  among  internal  consistency,  homogeneity,  and  Guttman's  idea  of 
reproducibility.  When  considered  together,  these  articles  stand  in  sharp  con¬ 
trast  to  each  other  due  to  the  serious  confusion  concerning  these  concepts  that 
has  developed  in  the  reliability  literature.  The  confusion  is  exemplified  by 
Green  et  al.’s  (1977)  equating  of  homogeneity  and  unidimensionality,  whereas 
Terwiiliger  and  Lele’s  (1977)  use  of  homogeneity  clarifies  one  use  of  the  term. 

To  avoid  perpetuating  confusing  terminology,  homogeneity  should  be  used 
only  in  the  sense  referred  to  by  Loevinger  (1957),  rather  than  in  the  sense  used 
by  Green  et  al .  (1977).  Internal  consistency  is  reflected  by  the  alpha  coeffi¬ 
cient  and  its  derivatives;  it  refers  to  the  degree  of  average  item  intercorrela¬ 
tion  among  a  set  of  test  items.  A  set  of  items  is  internally  consistent  when 
the  average  intercorrelation  is  high;  it  is  not  internally  consistent  when  the 
average  item  intercorrelation  is  low.  Linear  unidimensionality  is  what  Green  et 
al.  have  called  homogeneity.  Linear  unidimensionality  means  a  single  common 
linear  factor;  and  if  the  factor  is  prominent,  a  unidimensional  set  of  items 
will  also  be  internally  consistent. 


-  4  - 


Homogeneity,  on  the  other  hand,  is  not  unidimensionality.  Homogeneity  (in 
Loevinger's,  1957,  sense)  relates  to  the  ratio  of  the  sum  of  the  item  covari¬ 
ances  to  the  maximum  item  covariance.  In  the  extreme,  when  the  sum  of  the  item 
covariances  is  equal  to  the  maximum  possible  item  covariance,  a  linearly  unidi¬ 
mensional  set  of  items  might  result.  However,  homogeneity  (in  Loevinger's 
sense)  does  not  index  linear  unidimensionality  except  at  that  positive  extreme. 
Linear  unidimensionality  is  indexed  by  the  lack  of  variance  of  the  inter-item 
correlations  and  by  a  relatively  high  mean  value  of  item  intercorrelation,  which 
will  result  in  a  single  common  factor.  Item  intercorrelations  can  be  high,  on 
the  average,  yet  have  substantial  variance.  In  that  case,  more  than  one  factor 
may  exist  in  the  item  intercorrelation  matrix.  Consequently,  the  use  of  the 
term  homogeneous  should  not  be  equated  with  the  term  unidimensional.  Rather, 
homogeneous  should  be  used  only  in  the  sense  defined  by  Loevinger,  and  "inter¬ 
nally  consistent"  (referring  to  the  degree  of  average  item  intercorrelation) 
should  be  used  instead.  When  linear  unidimensionality  is  explicity  assumed, 
that  term  should  be  used  rather  than  other  terms  which  have  other  meanings. 

Somewhere  during  the  three-quarter  century  history  of  CTT,  the  major  pur¬ 
pose  of  reliability  estimation  seems  to  have  been  lost.  Reliability  coeffi¬ 
cients  in  and  of  themselves  have  little  utility  for  practical  situations,  except 
for  comparing  their  magnitudes  in  order  to  evaluate  measuring  instruments.  How¬ 
ever,  every  reliability  coefficient  should  be  viewed  only  as  a  step  toward  esti¬ 
mating  the  precision  of  an  individual  score.  In  the  history  of  scientific  in¬ 
vestigation,  only  psychometrics  has  developed  the  concept  of  reliability  coeffi¬ 
cients.  In  all  other  applications  of  measurement,  e.g.  physics  and  other  sci¬ 
ences,  precision  of  measurement  is  indexed  by  the  probable  deviation  of  an  ob¬ 
served  value  from  some  true  value.  Alternatively,  precision  is  estimated  by 
some  confidence  interval  that  is  likely  to  include  the  true  value.  Thus,  mea¬ 
surement  of  height  is  accurate  to  plus  or  minus  some  degree  of  error.  Yet  the 
preoccupation  in  psychometrics  seems  to  be  that  of  estimating  reliability  coef¬ 
ficients,  with  little  attention  paid  to  the  problem  of  estimating  the  precision 
of  an  individual  measurement  or,  conversely,  the  error  of  measurement. 

In  the  period  under  review,  only  two  papers  have  been  concerned  with  the 
standard  error  of  measurement,  the  psychometric  analogue  to  physical  errors  of 
measurement.  Dudek  (1979)  revived  some  long-forgotten  history  of  interpreta¬ 
tions  concerning  the  standard  error  of  measurement,  depending  on  whether  the 
user  of  the  measurement  is  concerned  with  estimating  true  score,  or  placing  an 
error  band  around  an  observed  score.  Kleinke  (1979)  demonstrates  bias  in  some 
approximations  to  the  standard  error  of  measurement  based  on  reliability  coeffi¬ 
cients,  and  Whitely  (1979)  is  concerned  with  methods  for  estimating  measurement 
error  on  highly  speeded  tests,  an  issue  that  has  not  been  adequately  resolved 
previously  within  the  context  of  CTT. 

Generalizability  Theory 

Although  nqt  technically  a  part  of  CTT,  generalizability  theory  is  really 
only  a  generalization  of  Hoyt's  basic  idea  of  variance  decomposition  of  a  person 
by  items  response  matrix,  originally  proposed  in  1941.  It  is  also  heavily  root¬ 
ed  in  "domain  sampling"  theory,  which  was  developed  most  explicity  by  Tryon 
(1957).  Although  originally  proposed  by  Cronbach,  Gleser,  Nanda,  and  Rajaratnam 
in  1972,  because  of  its  complexity  and  the  lack  of  procedures  for  estimating 
many  of  its  parameters,  generalizability  theory  had  not  been  brought  to  practi- 


cal  status  prior  to  the  period  under  review.  During  this  period,  several  devel¬ 
opments  in  generalizability  theory  have  occurred. 

Kaiser  and  Michael  (1975)  derived  Tryon's  (1957)  domain  validity  coeffi¬ 
cient  (which  bears  some  striking  similarities  to  Cronbach  et  al.’s  (1972)  gener¬ 
alizability  coefficient)  using  minimal  assumptions.  It,  of  course,  turned  out 
to  be  a  generalized  version  of  the  alpha  coefficient,  thus  perpetuating  what 
they  characterize  as  "one  of  the  favorite  indoor  sports  of  psychometricians"  (p. 
34),  but  requires  no  assumptions  about  the  means,  variances,  covariances,  or 
structure  of  the  items.  McDonald  (1978)  draws  relationships  between  the  idea  of 
"domain  validity"  and  the  concepts  of  generalizability  theory,  whereas  Cardinet, 
Tourneur,  and  Alial  (1976)  criticize  applications  of  generalizability  theory  to 
educational  measurement  and  suggest  examples  of  situations  in  which  the  vari¬ 
ables  on  which  differentiation  is  desired  are  opposite  those  that  are  appropri¬ 
ate  for  typical  generalizability  analyses.  Joe  and  Woodward  (1976)  develop  mul¬ 
tivariate  generalizability  theory,  estimating  components  of  maximum  generaliz¬ 
ability  and  multifacet  expermental  designs  with  multiple  dependent  variables 
(which  turn  out  to  be  multivariate  extensions  of  the  Spearman-Brown  formula). 
Brennan  attempts  to  bring  generalizability  theory  to  the  user  (e.g.  Brennan 
1980a),  develops  algorithms  and  procedures  for  the  estimation  of  variance  compo¬ 
nents  (e.g.  Brennan  1975)  and  provides  computer  programs  for  implementing  as¬ 
pects  of  generalizability  theory  (Brennan  1980b). 

Although  generalizability  theory  appears  to  be  a  useful  conceptualization 
that  has  begun  to  reach  practitioners  for  practical  application,  potential  users 
should  carefully  consider  some  of  its  assumptions  before  becoming  too  enamored 
with  it.  Rozeboom  (1978)  criticized  both  Kaiser  and  Michael  (1975)  and  general¬ 
izability  theory  in  terms  of  the  conceptual  existence  of  a  domain.  Rozeboom 
describes  the  logical  impossibility  of  sampling  from  a  domain  in  order  to  make 
the  assumptions  necessary  to  generate  both  coefficient  alpha  and  generalizabili¬ 
ty  theory.  He  also  indicates  that  domain  validity  provides  no  information  about 
the  domain,  since  it  is  strictly  a  function  of  the  number  of  items,  noting  fur¬ 
ther  that  domains  are  likely  to  be  multidimensional,  and  that  only  the  first 
dimension  is  estimated  by  domain  validity  and  the  variance  components  of  gener¬ 
alizability  theory.  Thus,  Rozeboom  questions  the  implicit  and  explicit  assump¬ 
tions  of  generalizability  theory  and  its  predecessors,  with  some  cogent  criti¬ 
cisms  which  should  be  carefully  considered  by  persons  who  use  this  approach  to 
the  estimation  of  measurement  precision. 

Other  Reliability  Issues 


The  measurement  of  change  has  received  a  fair  amount  of  attention  during 
the  period  reviewed.  A  minor  controversy  arose  between  Overall  and  Woodward 
(1975,  1976)  and  Fleiss  (1976)  when  Overall  and  Woodward  demonstrated  by  some 
derivations  that  the  power  of  _t-tests  is  at  the  maximum  when  the  reliability  of 
difference  scores  is  0.  Fleiss  showed  that  Overall  and  Woodward  assumed  a  re¬ 
strictive  model  with  no  interaction  between  subjects  and  time,  but  when  a  less 
restrictive  (and  more  realistic)  model  was  assumed,  then  the  maximum  power  of 
the  t-test  for  correlated  measures  is  attained  when  the  reliability  of  differ¬ 
ence  scores  is  a  maximum.  Overall  and  Woodward  (1976)  replied  that  Fleiss  was 
concerned  with  the  reliability  in  the  original  pre-test  and  post-test  scores, 
and  that  his  findings  are  correct  for  that  situation,  but  that  he  did  not  con¬ 
sider  the  reliability  of  difference  scores.  Overall  and  Woodward  then  reas- 


serted  that  the  power  of  a  pre-post-test  t-test  is  highest  when  difference 
scores  are  unreliab Le .  Williams  and  Zimmerman  (1977)  discuss  the  reliability  of 
difference  scores  when  errors  are  correlated  and  conclude  that  when  errors  are 
correlated  (which  may  well  be  the  case  in  a  number  of  applications),  difference 
scores  can  be  more  reliable  when  they  are  not  correlated,  which  is  the  usual 
assumption  made. 

Other  problems  in  the  measurement  of  change  are  addressed  by  Bond  (1979), 
Cascio  and  Kurtines  (1977),  Corder-Bolz  (1978),  Howard,  Ralph,  Gulanick,  Max¬ 
well,  Nance,  and  Gerber  (1979),  Hoogstraten  (1979),  Linn  and  Slinde  (1977),  and 
Richards  (1975).  Werts,  Linn,  and  Joreskog  (1977)  present  a  maximum  likelihood 
factor  model  for  estimating  reliabilities  in  unattenuated  correlations  between 
growth  measures,  whereas  Werts  and  Hilton  (1977)  and  Hilton  (1976) — also  using 
structural  equations  models — describe  direct  estimates  of  change  score  reliabil¬ 
ities  and  unattenuated  correlations  between  pre-test  and  change  scores. 

Applications  of  reliability  theory  in  experimental  design  and  the  analysis 
of  experimental  data  have  begun  to  receive  some  attention.  Nicewander  and  Price 
(1978)  extend  the  Overall  and  Woodward  (1975,  1976)  and  Fleiss  (1976)  controver¬ 
sy  to  a  discussion  of  the  reliability  of  dependent  variables  and  the  power  of 
significance  tests.  They  indicate  that  reliability  is  not  related  to  power  for 
controlled  experiments  and  that  under  certain  conditions  both  of  the  previous 
authors  are  correct.  Their  discussion  centers  on  the  problem  of  an  individual 
differences  versus  an  experimental  focus  in  the  research  design,  since  consider¬ 
ations  of  both  between  subjects  sampling  variance  and  measurement  error  variance 
are  relevant  to  the  reliability  and  power  issue.  Subkoviak  and  Levin  (1977; 
Levin  &  Subkoviak  1977,  1978)  and  Forsyth  (1978a,  1978b)  discuss  the  effects  of 
measurement  errors  on  the  power  of  statistical  tests.  Careful  reading  of  this 
interchange  indicates  that  the  nature  of  the  experimental  design  plays  some  role 
in  the  effect  of  reliability  on  power,  as  does  whether  observed  scores  or  true 
scores  are  being  considered. 

Other  Applications 

A  few  other  minor  issues  in  CTT  were  also  studied  during  this  period. 

Slinde  and  Linn  (1977b)  compared  linear  versus  equipercentile  methods  for  equat¬ 
ing  different  tests  given  to  different  groups  of  individuals.  They  found  that 
the  equipercentile  method  was  better  than  the  linear  method  but  that  both  had 
some  serious  problems  in  properly  equating  test  scores.  Rubin  and  Thayer  (1978) 
also  considered  the  problem  of  test  equating  in  the  situation  where  a  reference 
test  is  given  to  each  of  a  number  of  groups  and  new  tests  are  given  to  only  one 
of  the  groups.  The  problem  they  considered  was  to  estimate  scores  on  the  new 
tests  even  though  everyone  did  not  take  them.  Their  method  is  limited,  however, 
to  the  use  of  "plausible"  values  estimated  for  the  intercorrelations  among  the 
new  tests  and  the  standard  reference  test.  Healy  (1979)  formulated  a  test  of 
the  linear  relation  between  two  true  scores,  but  the  test  is  limited  to  the  sit¬ 
uation  in  which  the  covariance  matrices  of  the  two  tests  are  equal.  Lord  and 
Stocking  (1976)  developed  a  method  for  estimating  the  regression  function  of 
true  score  and  observed  score  assuming  a  binomial  error  model  but  not  assuming 
that  true  score  and  error  scores  are  linearly  related.  Hie  attenuation  paradox, 
and  its  relationship  to  the  distribution  of  ability,  is  considered  by  Nice¬ 
wander,  Price,  Mendoza,  and  Henderson  (1977);  they  indicate  that  the  attenuation 
paradox  will  result,  regardless  of  the  distribution  of  ability,  if  items  of 


"perfect  discrimination"  are  used.  Finally,  Zimmerman  (1976)  develops  CTT  from 
concepts  of  probability  theory  and  statistical  sampling  theory,  rather  than  from 
the  usual  assumptions.  The  problem,  of  course,  is  that  neither  probability  the¬ 
ory  nor  statistical  sampling  theory  have  any  relationship  to  the  psychological 
processes  underlying  test  behavior,  thus  further  removing  CTT  from  the  main¬ 
stream  of  psychology. 

Item  Analysis 

Virtually  no  progress  was  made  during  the  period  in  the  area  of  item  analy¬ 
sis.  The  papers  that  have  appeared  seem  to  be  either  repetitions  of  what  has 
been  done  for  years  in  item  analysis  or  minor  extensions  of  techniques  already 
available.  For  example,  D'Agostino  and  Cureton  (1975)  concluded  that  the  old 
"27%  rule" — contrasting  the  proportions  correct  for  the  upper  and  lower  27%  of 
the  score  distribution — is  acceptable  but  that  a  21%  rule  would  be  better.  Berk 
(1978)  empirically  evaluated  formulas  for  corrections  of  item-total  point-bise¬ 
rial  correlations,  and  Beuchert  and  Mendoza  (1979)  find  very  few  differences 
among  10  item  discrimination  indices,  as  did  Oosterhof  (1977)  in  his  factor 
analysis  of  19  item  discrimination  indices.  Both  Aiken  (1979)  and  Hoffman 
(1975)  concerned  themselves  with  the  age  old  issue  of  choosing  items  based  on 
both  difficulty  and  discrimination  indices. 

In  the  area  of  personality  measurement,  Neill  and  Jackson  (1976)  developed 
an  item  efficiency  index  designed  to  reduce  scale  intercorrelations  for  a  multi¬ 
scale  measure,  since  multiscale  measures  are  likely  to  be  more  valid  against 
external  criteria  if  interscale  correlations  are  low.  Their  method  is  illus¬ 
trated  with  personality  data  but  has  applications  to  other  multiscale  batteries. 
Yet  another  demonstration  that  observed  score  distributions,  and  therefore  pro¬ 
portion  of  testees  passing  any  given  cutoff  score,  can  be  manipulated  by  the  way 
items  are  selected  on  item  difficulty  is  given  by  Dyck  and  Poencke-Schuyten 
(1976);  Nevo  (1977)  demonstrates  what  should  be  obvious — that  traditional  item 
analysis  does  not  increase  test-retest  reliability;  and  in  a  somewhat  useful 
development,  Hsu  (1978)  gives  appropriate  alpha  levels  to  use  in  testing  item 
analysis  statistics  when  the  use  of  multiple  items  changes  the  exper imentwise 
error  rate. 


Item  Response  Modes 

A  little  psychology  begins  to  interact  with  test  theory  when  real  people 
begin  to  take  real  test  items.  Since  the  beginning  of  CTT,  the  "objective"  test 
item  has  been  the  rule,  usually  characterized  as  a  dichotomous  (true-false)  item 
or  as  the  ubiquitous  multiple-choice  item.  Ever  since  the  invention  of  these 
test  item  formats,  a  number  of  questions  have  arisen,  and  research  still  contin¬ 
ues  in  an  attempt  to  answer  them.  The  questions  arise  from  the  fact  that  the 
objective  test  format  leads  to  some  loss  in  information  on  a  testee's  ability/ 
achievement  level  and  may  introduce  other  sources  of  variability  in  test  item 
responses  (such  as  guessing)  in  addition  to  the  variable  that  the  test  item  is 
designed  to  measure.  Research  on  these  issues  has  consistently  manifested  it¬ 
self  in  several  areas:  (1)  attempts  to  study  the  effects  of  guessing  on  various 
item  response  formats;  (2)  attempts  to  reduce  the  effects  of  guessing  by  the  use 
of  various  scoring  formulas;  (3)  the  effect  of  various  item  option  weighting 
schemes;  (4)  the  use  of  alternative  response  formats  within  multiple-choice 
items;  and  (5)  studies  on  the  use  of  alternative  response  formats. 


True-false  tests.  Psychometricians  frequently  decry  the  use  of  true-false 
(T-F)  tests  probably  because  of  the  high  probability  of  guessing  on  these  kinds 
of  items.  Consequently,  a  number  of  authors  have  proposed  alternatives  to  T-F 
tests  such  as  paired  T-F  tests  (Eakin  &  Long  1977)  in  which  two  T-F  items  are 
presented  together  and  the  testee  answers  as  if  the  item  were  a  four-choice  mul¬ 
tiple-choice  item,  indicating  the  truth  or  falseness  of  each  of  the  four  item 
pairs.  These  authors  suggest  that  this  approach  better  reflects  the  true  knowl¬ 
edge,  in  comparison  to  a  number-correct  score  (but  not  in  comparison  to  a  cor¬ 
rect  minus  incorrect  score).  Hsu  (1979b)  studied  a  similar  type  of  T-F  test  and 
found  an  interaction  with  knowledge  level  of  examinees,  in  which  the  grouped  T-F 
items  (with  two  or  tnree  items  per  cluster)  were  better  for  low  ability  testees 
than  the  separate  T-F  items,  which  were  better  for  moderate  and  high  ability 
testees.  Hsu  (1979b)  followed  up  on  Eakin  and  Long's  results  and  reported  that 
their  method  of  scoring  the  paired-item  T-F  test  results  in  misranking  testees 
under  certain  conditions.  Finally,  Aiken  and  Williams  (1978)  studied  the  ef¬ 
fects  of  instructions  to  testees  on  seven  methods  of  scoring  T-F  items.  Their 
results  supported  those  of  Hsu  (1979a)  showing  an  interaction  of  scoring  methods 
with  ability  levels. 

Formula  scoring  in  multiple-choice  tests.  Given  the  decision  to  use  a  mul¬ 
ti  pie -cboicettT-cTiTeraTthepsychometrTciarrmust  next  determine  how  that  test 
is  to  be  scored.  Obviously,  it  is  simple  to  count  the  number  of  items  answered 
correctly,  but  the  problem  of  guessing,  when  a  testee  does  not  know  the  answer, 
rears  its  ugly  head.  The  history  of  CTT  has  concerned  itself  for  some  time  with 
the  guessing  problem,  and  one  classical  solution  is  that  of  formula  scoring.  A 
formula  score  is  a  score  other  than  a  number-correct  score  that  is  designed  to 
adjust  the  number-correct  score  for  chance  successes  due  to  guessing.  A  typical 
formula  score  is  the  number  correct  minus  some  fraction  of  the  number  answered 
incorrectly.  Lord  (1975a,  1975b)  began  a  minor  controversy  by  studying  the  rel¬ 
ative  efficiency  of  number-correct  and  formula  scores,  using  concepts  of  latent 
trait  test  theory  to  compare  the  two  scoring  methods.  Although  this  approach 
may  be  unfamiliar  to  many,  since  scoring  methods  are  usually  compared  in  terms 
of  reliability,  the  comparison  made  by  Lord  can  be  interpreted  in  a  quasirelia¬ 
bility  sense  through  the  relationships  between  the  concept  of  information  and 
error  of  measurement  (see  below).  Lord's  main  result  was  that  formula  scoring 
is  3%  to  9%  more  efficient  than  number-correct  scoring,  primarily  for  testees  of 
moderate  to  low  ability  levels. 


A  number  of  researchers  have  taken  issue  with  Lord's  finding,  based  primar¬ 
ily  on  the  particular  set  of  assumptions  that  Lord  implicitly  made  in  his  study. 
Specifically,  Lord  assumed  that  formula  scoring  instructions  concerning  omitting 
responses  would  result  in  random  guesses  if  the  same  test  was  administered  under 
number-correct  instructions.  Cross  and  Frary  (1977)  administered  a  20-item 
test,  including  four  items  with  no  correct  answer,  using  formula  scoring  in¬ 
structions  (as  specified  by  Lord  1975a)  then  had  testees  answer  the  omitted 
items  in  a  different  color.  For  Cross  and  Frary' s  (1977)  more  than  407  testees, 
results  showed  that  there  were  many  omitted  items  that  the  testees  had  a  better 
than  chance  probability  of  answering  correctly.  The  results  also  indicated  that 
27%  of  the  testees  did  not  recall  the  exact  directions  under  which  the  test  was 
administered.  Their  results  did  not  support  Lord's  assumptions  however,  even 
for  those  who  did  understand  and  comply  with  the  directions.  Cross  and  Frary 
concluded,  therefore,  that  the  correction  for  guessing  should  not  be  used.  Sim¬ 
ilarly,  Wood  (1976)  administered  M-C  tests  under  three  different  kinds  of  in- 
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struct  ions  and  examined  whether  the  instructions  produced  the  desired  effects 
for  different  groups.  Wood's  conclusion  is  similar  to  Cross  and  Frary's  in  that 
he  found  the  instructions  worked  only  to  some  extent  in  eliminating  blind  guess¬ 
ing. 


Similarly,  Rowley  and  Traub  (1977)  also  studied  the  behavior  of  testees 
under  Lord's  (1975a)  formula  scoring  instructions  in  an  ability  testing  situa¬ 
tion.  Rowley  and  Traub' 8  (1977)  conclusion  is  that  formula  scoring  is  not  logi¬ 
cally  supported  because  of  differential  personality  factors  on  the  part  of  the 
testees,  idiich  enter  into  the  responses  to  M-C  test  items.  This  result  is  sup¬ 
ported  by  Slatker,  Crehan,  and  Koehler  (1975),  who  examined  risk-taking  on  ob¬ 
jective  examinations  and  found  that  guessing  occurred  even  when  the  penalty  for 
guessing  was  known.  They  also  found  age  and  sex  differences  in  means  and  sta¬ 
bilities  of  a  risk-taking  score.  In  an  attempt  to  develop  better  formula 
scores,  Reid  (1977)  and  Abu-Sayf  (1977)  developed  new  versions  of  formula 
scores,  and  Molenaar  (1977)  took  a  Bayesian  approach  to  correcting  for  random 
guessing.  The  weight  of  the  data  accumulated  during  this  period  is  that  formula 
scoring  of  M-C  tests  has  some  problems  but  that  number  correct  scores  are  also 
nonoptimal  when  guessing  is  likely  to  occur.  Finally,  Grier  (1975)  suggested 
that  three-alternative  M-C  items  (scored  by  number  correct)  maximize  expected 
test  reliability,  with  two-alternative  items  better  than  four-,  five-,  or  six- 
alternative  items.  It  should  be  noted,  however,  that  this  conclusion  is  valid 
only  if  test  length  is  increased  to  compensate  so  that  the  number  of  alterna¬ 
tives  times  the  number  of  items  is  fixed.  His  results  are  also  restricted  to 
the  specified  assumptions  regarding  the  quality  of  the  test  items  and  should  not 
be  accepted  as  an  across-the-board  recommendation. 

Option  weighting.  Since  formula  scoring  does  little  to  increase  the  amount 
of  information  obtainable  due  to  partial  knowledge  from  M-C  items,  a  consider¬ 
able  amount  of  research  over  the  years  has  been  spent  in  comparing  methods  for 
weighting  the  options  of  M-C  items  and  including  those  option  weights  in  total 
scores.  The  last  five-year  period  was  no  exception  to  this  trend.  Building  on 
work  by  Mosier  in  the  1930s  and  Guttman  in  the  1940s,  methods  of  option  weight¬ 
ing  have  continued  to  receive  considerable  amounts  of  attention.  Claudy  (1978) 
proposed  the  biserial  correlation  of  each  option  with  total  score  as  an  item 
option  weighting  scheme.  He  compared  his  method  with  a  number  of  other  option¬ 
weighting  schemes  and  evaluated  the  results  with  split-half  reliability  coeffi¬ 
cients.  Serlin  and  Kaiser  (1978)  increased  the  reliability  of  M-C  tests  by 
weighting  item  responses  based  on  their  loadings  on  the  first  principal  compo¬ 
nent  of  the  intercorrelations  among  the  0-1  weights  for  each  item  alternative. 
They  show  increases  in  alpha  from  .44  to  .77  for  0-1  scoring  but  did  not  cross- 
validate  their  results.  Downey  (1979),  Raffeld  (1975),  Reiley  (1975),  and  Ech- 
ternacht  (1976)  compared  various  item  option  weighting  schemes  in  terms  of  reli¬ 
ability  and  validity  against  a  number  of  criteria.  Their  general  conclusion  is 
that  there  are  almost  always  increases  in  reliability  for  most  item  option 
weighting  schemes  and  occasional  increases  in  validity,  depending  on  the  nature 
of  the  validity  criterion  used.  Bejar  and  Weiss  (1977)  demonstrated  that  the 
increases  in  the  reliability  of  option-weighting  schemes  are  generally  a  func¬ 
tion  of  the  item  intercorrelations,  whereas  Echternacht  (1975)  derived  formulas 
for  estimating  the  variances  of  one  type  of  option  weights  and  provided  sugges¬ 
tions  for  item  writing  to  decrease  these  variances.  Cross  and  Frary  (1978) 
studied  the  effects  of  "guess"  and  "do  not  guess"  instructions  on  empirical 
choice  weighting  and  compared  them  to  number-correct  and  formula  scores.  Their 


results  showed  no  differences  in  validities  between  the  guessing  conditions  but 
higher  validities  for  the  empirical  choice  weighting  procedure. 

Different  response  modes.  Since  formula  scoring  offers  little  improvement 
over  number-correct  scoring,  and  the  results  regarding  differential  option 
weighting  have  been  mixed,  the  search  continues  for  ways  to  improve  the  charac¬ 
teristics  of  scores  obtained  from  M-C  test  items.  Dunkin  and  Milton  (1978)  pro¬ 
pose  simply  modifying  the  answering  procedure  in  a  M-C  item  by  constructing 
items  of  any  number  of  correct  responses  and  by  permitting  the  testee  to  choose 
any  number  of  answers  as  correct.  They  call  these  multiple-answer-multiple- 
choice  items,  develop  relevant  scoring  rules,  and  evaluate  some  Bayesian  and 
minimax  strategies  for  responding  to  these  kinds  of  items.  Their  proposal  is 
interesting  and  might  stimulate  some  relevant  research  that  might  result  in  im¬ 
proved  scores.  They  also  examine  the  possibility  of  probabilistic  responding  in 
which  the  testee  responds  with  subjective  probabilities  for  each  alternative, 
one  at  a  time. 

Another  procedure  for  a  different  kind  of  responding  to  M-C  items  is  the 
answer-until-correct  (AUC)  procedure.  These  procedures  have  been  studied  since 
the  early  1950s,  based  on  work  by  Coombs  (1953).  Recent  research  on  this  prob¬ 
lem  illustrates  theoretical  increases  in  efficiency  for  this  procedure  (Gibbons, 
Olkin,  &  Sobel  1979),  provides  equations  making  it  possible  to  select  items  that 
minimize  guessing  under  AUC  administration  (Kane  &  Moloney  1978),  and  demon¬ 
strates  higher  levels  of  reliability  but  lower  levels  of  validity  (Hanna  1975). 
Although  AUC  procedures  may  reduce  the  number  of  items  required  in  a  test,  their 
use  may  also  increase  testing  time. 

Confidence  weighting  also  continues  to  be  studied.  In  this  procedure,  tes- 
tees  answer  an  item  either  (1)  by  choosing  a  correct  answer  and  assigning  some 
confidence  level  to  their  choice  or  (2)  by  distributing  some  fixed  number  of 
points  (e.g.  100),  indicating  their  confidence  that  each  of  the  M-C  alternatives 
is  correct.  Another  variation  is  simply  ordering  the  alternatives  in  the  item 
on  the  basis  of  correctness.  Diamond  (1975),  Poizner,  Nicewander,  and  Gettys 
(1978),  Wen  (1975),  and  Pugh  and  Brunza  (1975)  all  found  increases  in  reliabili¬ 
ty  of  confidence-weighted  scores  over  nonconfidence-weighted  scores,  and  Abu- 
sayf  and  Diamond  (1976)  found  increases  both  in  reliability  and  validity. 

In  a  relatively  comprehensive  comparison  of  several  methods  of  assessing 
partial  knowledge  in  M-C  items,  Hakstian  and  Kansup  (1975)  and  Kansup  and  Hak- 
stian  (1975)  compared  both  conventional  and  confidence  rating  instructional  sets 
and  a  number  of  different  scoring  methods — including  confidence  rating  scoring 
on  both  verbal  ability  and  mathematical  reading  tests — in  terms  of  both  reli¬ 
ability  and  validity.  They  concluded  that  confidence-rated  scores  were  more 
internally  consistent  and  more  stable  than  conventional  scores.  They  also  found 
higher  validities  for  confidence-rated  mathematics  scores  in  comparison  to  the 
conventional  scores,  but  the  differences  were  not  statistically  significant. 

For  the  verbal  ability  tests,  the  conventional  scores  were  more  valid  than  any 
of  the  option-rating  scores.  They  conclude  that  although  the  validity  results 
were  mixed,  including  some  higher  levels  of  validity  for  confidence-rated  scores 
on  their  grade-point  average  criterion,  confidence  rating  testing  time  was  lon¬ 
ger,  and  more  conventional  test  items  could  compensate.  However,  these  are  sim¬ 
ply  extrapolations  from  their  data  that  have  not  been  supported  by  empirical 
results.  Thus,  the  data  tended  to  show  some  specific  utility  for  the  confi- 
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dence-rating  approaches,  perhaps  greatest  in  more  complex  abilities  such  as 
mathematics  than  in  abilities  that  rely  less  on  information-processing  charac¬ 
teristics  of  the  individual,  such  as  verbal  ability. 

Linn  (1976b)  critically  evaluted  probability  responding  procedures,  espe¬ 
cially  the  "value  assigned"  score,  where  a  person's  score  is  the  probability 
value  assigned  to  correct  responses.  Linn  indicated  that  the  value-assigned 
score  does  not  provide  a  reproducing  scoring  system,  since  the  best  strategy  is 
an  all  or  none  response;  and  since  testees  do  not  respond  rationally  to  differ¬ 
ing  probability  situations  (where  rationally  is  defined  as  maximizing  the  total 
scores),  probabilistic  testing  procedures  must  be  designed  in  terms  of  the  in¬ 
structions  to  the  testee  in  order  to  eliminate  these  problems.  Linn's  sugges¬ 
tion  is  that  if  probabilistic  procedures  are  combined  with  appropriate  instruc¬ 
tions  to  the  testee,  gains  in  utility  of  the  test  scores  might  result. 

One  obvious  solution  to  the  problems  inherent  in  M-C  tests  is  simply  to 
replace  the  M-C  item  with  some  other  kind  of  test  item.  Obviously,  the  M-C  item 
has  retained  its  popularity  because  of  the  ease  in  objective  scoring  of  such 
items  and  the  rapidity  with  which  scores  can  be  derived.  As  recent  research  on 
computer-administered  testing  (e.g.  Brown  &  Weiss  1977;  Cory  1976;  Cory,  Rimland 
&  Bryson  1977;  Kingsbury  &  Weiss  1979a;  Weiss  1976)  results  in  the  ultimate  re¬ 
placement  of  the  paper  and  pencil  test  by  computer-administered  tests,  the  pos¬ 
sibility  of  replacing  the  M-C  item  with  free-response  items  becomes  realistic. 
Very  little  research,  however,  has  been  don'-  on  the  use  of  free-response  items. 
Vale  and  Weiss  (1977)  compared  free-response  and  M-C  vocabulary  tests  in  terms 
of  information  concepts  from  latent  trait  test  theory.  Their  data  show  substan¬ 
tial  increases  in  precision  of  measurement  to  be  gained  from  the  use  of  free-re¬ 
sponse  items  in  comparison  to  M-C  items,  primarily  for  testees  of  middle  and 
high  ability  levels. 

Traub  and  Fisher  (1977)  compared  the  factor  structure  of  free-response  and 
M-C  items,  as  well  as  AUC  scoring,  using  verbal  comprehension  and  mathematical 
reasoning  items.  Their  tests  were  carefully  designed  to  be  equivalent,  and  the 
factor  structures  were  compared  by  methods  of  confirmatory  factor  analysis.  The 
results  showed  that  the  different  item  formats  measured  the  same  factors  in  the 
mathematics  test  but  that  the  structure  of  the  verbal  test  was  a  function  of  the 
format,  with  the  free-response  format  resulting  in  a  more  complex  factorial 
structure  than  the  other  two  item  types.  These  results  suggest  that  additional 
information  may  exist  in  free-response  data,  which  may  permit  different  kinds  of 
measurement  to  be  obtained  from  different  response  formats,  and  that  care  should 
be  taken  in  the  translation  of  existing  tests  into  new  response  formats,  since 
different  dimensions  may  be  measured  by  the  same  items  cast  in  different  re¬ 
sponse  formats.  Harris  and  Pearlman  (1978)  provide  a  domain-oriented  index  of 
response  agreement  for  free-response  items. 


ALTERNATIVES  TO  CLASSICAL  TEST  THEORY 

Because  classical  test  theory  has  been  unable  to  adequately  solve  a  number 
of  testing  problems  during  its  history,  several  alternative  test  models  have 
been  proposed.  Criterion-referenced  testing  has  developed  and  flourishes  in  an 
attempt  to  solve  the  mastery  testing  problem.  Latent-trait  based  test  theories 
continue  to  be  refined  and  applied  to  a  wide  range  of  problems  for  which  classi- 
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cal  test  theory  is  inadequate.  Order-based  test  models,  which  have  developed 
during  the  last  few  years,  show  some  promise  for  measuring  certain  kinds  of  psy¬ 
chological  variables,  and  a  few  other  miscellaneous  new  approaches  have  been 
proposed. 


Criterion-Referenced  Testing 

The  extensive  literature  on  criterion-referenced  testing  can  be  divided 
into  two  parts:  (1)  articles  dealing  with  conceptual  issues  and  (2)  articles 
dealing  with  technical  considerations. 

Conceptual  Issues 

Popham  (1975,  p.  130)  and  Hambleton,  Swaminathan,  Algina,  and  Coulson 
(1978,  p.  2)  define  a  criterion-referenced  test  (CRT)  as  one  "used  to  ascertain 
an  individual's  status  [the  individual's  domain  score]  with  respect  to  a  well- 
defined  behavior  domain."  If  this  definition  captured  the  sole  essence  of  CRT, 
the  term  CRT  would  be  less  appropriate  than  Hively's  (Hively,  Patterson,  &  Page 
1968)  "domain-referenced  testing,"  Ebel's  (1962)  "content-referenced  testing," 
or  Osburn's  (1968)  "universe-defined  testing."  Swaminathan,  Hambleton,  and  Al¬ 
gina  (1974)  pointed  out  that  a  CRT  is  used  primarily  to  ascertain  a  student's 
standing  with  respect  to  a  prescribed  mastery  (pass-fail)  standard  and  hence  the 
name  CRT.  In  some  cases,  prior  normative  information  is  used  in  setting  the 
criterion  (Pinney  1979;  Popham  1978),  a  practice  that  blurs  the  distinction  be¬ 
tween  norm-  and  criterion-referenced  testing  but  that  helps  avoid  unrealisti¬ 
cally  high  or  low  criteria. 

Glass  (1978),  along  with  Burton  (1978),  Levin  (1978),  Linn  (1978a),  and 
Messick  (1975),  criticized  the  use  of  mastery  criteria  in  testing  because  such 
criteria  are  necessarily  arbitrary.  Linn  (1978a),  however,  doubts  that  educa¬ 
tors  can  easily  sidestep  the  demand  for  standards  reflected  in  opinion  polls  and 
deliberations  of  state  legislatures.  Glass  (1978)  concludes  that  measurements 
should  be  based  on  comparative  standards  of  better  or  worse  rather  than  on  arbi¬ 
trary  mastery  levels. 

In  a  well-reasoned  response  to  Glass  (1978),  Hambleton  (1978)  argued  that 
mastery  criteria  are  arbitrary,  but  in  the  best  sense  of  the  word;  they  are 
standards  reflecting  professional  judgment  and  discretion.  For  all  their 
faults,  he  argues,  such  criteria  are  still  the  best  basis  for  educational  deci¬ 
sions.  Block  (1978),  Popham  (1978),  and  Scriven  (1978)  offered  further  rebut¬ 
tals  to  the  critics  of  standard  setting,  and  Terwilliger  (1977)  discussed  the 
philosophical  issues  involved  in  setting;  standards. 

When  institutional  limitations  impose  constraints  on  the  number  of  examin¬ 
ees  who  can  be  selected,  these  constraints  often  imply  a  normative  approach. 

For  instance,  if  a  college  has  facilities  for  only  500  new  freshmen  and  wants 
the  500  that  are  best  qualified,  then  selection  will  necessarily  be  based  on 
normative  comparisons  between  applicants.  Often,  however,  no  such  constraints 
exist.  For  instance,  nothing  typically  constrains  the  number  of  students  re¬ 
ceiving  a  passing  course  grade,  the  number  of  students  moved  to  the  next  level 
of  instruction  in  computer-aided  instruction,  the  number  of  drivers'  licenses 
issued  by  a  state,  or  the  number  of  professionals  licensed  by  a  state  for  a  giv¬ 
en  field.  In  situations  where  no  institutional  constraints  operate,  the  use  of 
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mastery  criteria  is  commonplace,  sensible,  and  not  overly  controversial. 

The  controversy  over  standards  centers  primarily  around  their  use  in  large- 
scale  statewide  or  districtwide  assessments  to  determine  who  will  and  will  not 
be  given  diplomas.  Because  any  adopted  standard  can  be  used  either  as  a  guide¬ 
line  in  evaluating  student  performance  or  as  a  tool  for  assigning  diplomas,  the 
standards  debate  should  center  not  only  around  the  question  of  whether  standards 
will  be  used  but  also  around  the  question  of  how  they  will  be  used. 

Technical  Issues 


A  number  of  technical  issues  have  received  minor  attention.  Wilcox  (1976, 
1979a)  describes  methods  of  deciding  the  optimal  length  for  a  CRT.  Both  Kings¬ 
bury  and  Weiss  (1979)  and  Spineti  and  Hambleton  (1977)  present  computerized 
adaptive  CRT  strategies,  which  can  reduce  the  number  of  items  needed  to  make 
mastery  decisions.  Van  der  Linden  (1979)  argues  that  binomial  test  models,  the 
models  on  which  much  of  CRT  theory  rests,  impose  some  unrealistic  conditions  on 
item  characteristic  curves.  Lewis,  Wang,  and  Novick  (1975),  and  Wilcox  (1978a, 
1979b,  1979c)  propose  methods  of  estimating  true  domain  scores  on  CRTs. 

Most  of  the  CRT  item  analysis  procedures  are  variations  of  conventional 
methods  (Haladyna  1974;  Lord  197 7d ;  Mehrens  &  Lehman  1978;  Panell  &  Laabs  1979). 
For  instance,  Mehrens  and  Lehman  (1978,  p.  334)  suggest  pruning  items  on  the 
basis  of  a  discrimination  index  reflecting  the  difference  between  the  item's 
difficulty  in  a  pre-  and  post-instructional  group.  Several  authors  argue,  how¬ 
ever,  that  pruning  items  on  the  basis  of  difficulty  or  discrimination  indices 
violates  the  very  concept  of  a  CRT,  defined  as  a  test  designed  to  assess  an  in¬ 
dividual's  status  in  a  well-defined  behavior  domain  (Levine  1976;  Osburn  1968; 
Shoemaker  1974).  Kwansa  (1974)  found  that  after  items  were  pruned  on  the  basis 
of  conventional  item  statistics,  the  items  remaining  were  not  representative  of 
the  original  domain.  With  the  exception  of  Rovinelli  and  Hambleton  (1977),  the 
CRT  critics  of  conventional  item  development  procedures  have  been  too  glib  about 
the  problems  in  selecting  items  to  be  representative  of  a  domain,  problems  exac¬ 
erbated  by  the  fact  that  the  domain  is  often  so  vaguely  defined. 

One  of  the  most  frequently  discussed  problems  in  CRT  is  that  of  setting  the 
criterion.  Meskauskas  (1976)  reviews  the  suggested  methods,  covering  several 
papers  that  appeared  before  the  period  of  this  review.  Methods  of  setting  the 
criterion  that  maximize  subjective  expected  utility  functions  under  various  sets 
of  assumptions  have  been  proposed  by  Macready  and  Dayton  (1977),  Huynh  (1976b), 
Huynh  and  Perney  (1979)  and  Wilcox  (1979d).  Swarainathan,  Hambleton,  and  Algina 
(1975),  and  Wilcox  (1977,  1 979f )  discuss  a  related  problem,  that  of  estimating 
the  probability  of  false  negative  and  false  positive  errors  in  making  mastery 
decisions.  Some  decision  makers,  however,  may  feel  uncomfortable  making  the 
utility  judgments  that  these  criterion-setting  methods  require.  Further,  these 
methods  for  setting  an  observed  mastery  score  require  that  there  already  exist  a 
set  criterion  score  either  on  a  referral  task  or  in  terms  of  true  scores,  a  re¬ 
quirement  that  begs  the  fundamental  question  in  most  cases. 

Because  criterion-referenced  tests  need  not  have  variance,  Millraan  and  Pop- 
ham  (1974)  argue  that  classical  reliability  statistics  are  not  apropriate  mea¬ 
sures  of  test  precision.  Like  Woodson  (1974),  we  are  left  puzzled.  If  a  major 
purpose  of  a  CRT  is  to  classify  examinees  into  mastery  states  (Swaminathan  et 
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al.  1974),  then  what  purpose  could  be  served  by  a  test  on  vAiich  every  examinee 
in  the  population  of  examinees  for  which  the  test  was  intended  would  obtain  the 
same  score? 

Whether  prompted  by  Millman  and  Popham  (1974)  or  not,  considerable  effort 
has  been  expended  to  find  alternatives  to  classical  reliability  for  CRT.  Sever¬ 
al  suggestions  are  based  on  a  redefinition  of  test  variance  as  variance  around 
the  criterion  score,  rather  than  variance  around  the  mean  score.  Livingston 
(1972)  redefined  reliability  as  the  ratio  of  true  score  variance  about  the  cri¬ 
terion  to  observed  score  variance  about  the  criterion.  Brennan  and  Kane  (1977) 
and  Lovett  (1977)  develop  analysis  of  variance  estimates  of  generalizability  and 
reliability  coefficients  that  are  extensions  of  Livingston's  (1972)  notions. 

Beginning  with  Swaminathan  et  al .  (1974),  a  number  of  coefficients  of  pre¬ 
cision  for  CRT  based  on  decision  concepts  have  been  proposed.  Swaminathan  et 
al.  proposed  Cohen's  (1960)  kappa,  idiich  measures  the  consistency  of  decisions 
on  two  parallel  tests.  This  index  can  be  estimated  directly  only  if  there  are 
two  test  administrations.  Huynh  (1976a,  1979),  Marshall  and  Haertel  (1976), 
Strasler  and  Raeth  (1977),  and  Subkoviak  (1976)  discussed  methods  of  estimating 
kappa  from  a  single  test  administration.  Algina  and  Noe  (1978)  and  Subkoviak 
(1978)  critically  evaluated  some  of  these  single  administration  estimates. 

Beginning  with  van  der  Linden  and  Mellenbergh  (1978),  the  literature  on 
decision-based  coefficients  of  test  precision  starts  winding  toward  a  surprising 
conclusion.  Mellenbergh  and  van  der  Linden  (1979)  argue  that  tests  should  be 
evaluated,  not  on  the  consistency  of  decisions  across  two  occasions,  but  on  the 
consistency  between  decisions  based  on  the  test  and  decisions  which  would  be 
made  if  the  true  scores  were  known.  With  this  consideration  in  mind,  van  der 
Linden  and  Mellenbergh  (1978),  Mellenbergh  and  van  der  Linden  (1979),  and  Wilcox 
(1978b)  propose  a  rescaling  of  Bayes  risk  as  a  decision-theoretic  index  of  test 
quality.  Bayes  risk  is  the  expected  value  of  (decision)  losses  with  respect  to 
the  joint  distribution  of  random  variables  ]T  (true  score)  and  3t  (observed  score) 
in  a  given  population.  Another  decision-theoretic  proposal  is  offered  by  Liv¬ 
ingston  and  Wingersky  (1979).  Seemingly  to  their  surprise  and  ours,  van  der 
Linden  and  Mellenbergh  (1978)  manage  to  show  that  their  rescaling  of  Bayes  risk 
is  equal  to  the  classical  reliability  coefficient  if  a  linear  or  squared  error 
loss  function  is  assumed  and  if  a  linear  regression  of  true  on  observed  scores 
is  assumed. 

As  van  der  Linden  and  Mellenbergh  (1978)  suggest,  a  measure  of  test  preci¬ 
sion  should  reflect  the  correspondence  between  decisions  reached  using  true  and 
observed  scores.  Coefficient  kappa  does  not  do  so.  Kappa  and  coefficients  de¬ 
rived  from  Livingston's  work  are  highly  situation  specific,  because  they  can 
vary  a  great  deal  depending  on  where  the  user  sets  the  criterion.  If  the  pre¬ 
mise  is  accepted  that  a  useful  CRT  must  have  nonzero  variance  in  the  population 
for  which  it  is  intended,  then  the  classical  reliability  coefficient  may  be  a 
suitable  index  for  CRT  after  all.  It  has  a  decision-theoretic  interpretation, 
it  does  not  depend  on  where  the  criterion  is  set,  and  it  is  readily  understood 
by  many  users.  The  standard  error  of  measurement  is  also  a  useful  index  of  CRT 
precision  because  it  can  be  used  to  estimate  the  probability  of  misc lassifying 
an  examinee  for  any  desired  criterion  level. 

Forsyth  (1976),  Pandey  and  Shoemaker  (1975),  and  Raju  (1977a)  discuss  mul- 
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tiple-matrix  sampling  techniques  that  may  be  used  in  CRT.  Myerberg  (1979)  found 
that  stratifying  items  by  difficulty  and  content  did  not  improve  estimates  of 
mean  test  scores  in  a  multiple  matrix  sampling  design. 

CRTs  have  been  well  represented  in  classrooms,  if  not  in  measurement  texts, 
for  as  long  as  there  has  been  education.  CRT  will  continue  to  endure  in  the 
form  of  classroom  exams,  licensing  exams,  and  tools  of  computer-aided  instruc¬ 
tion.  Many  of  the  recently  developed  psychometric  methods  for  CRT  may  not  en¬ 
dure  so  long.  For  example,  because  they  require  a  previously  set  criterion 
score  on  a  referral  task  or  on  the  true  score  continuum,  the  methods  of  setting 
the  criterion  that  are  based  on  subjective  expected  utility  theory  largely  beg 
the  question.  Although  kappa  is  gaining  in  popularity  as  a  measure  of  test  sta¬ 
bility,  it  will  not  soon  supplant  standard  variance-baaed  indices  of  test  preci¬ 
sion.  Sophisticated  CRT  methods  of  estimating  true  scores  and  setting  test 
length  can  be  expected  to  receive  no  more  use  than  they  have  received  in  more 
conventional  testing. 


Latent  Trait  Test  Theory 


Latent  trait  test  theories  have  their  roots  in  work  in  the  1940s  by  Mosier, 
Guttman,  and  Lazarsfeld,  among  others.  Although  the  basic  ideas  were  known 
about  40  years  ago,  the  methods  could  not  be  successfully  applied  until  high¬ 
speed  computing  equipment  was  available  to  psychometricians  for  research  and 
applications.  As  this  equipment  became  better  available  to  solve  some  of  the 
problems  of  latent  trait  test  theories  during  the  early  1960s,  models  were  fur¬ 
ther  developed  and  techniques  further  refined.  A  second  barrier  to  the  applica¬ 
tion  of  these  techniques  was  that  of  their  sophisticated  mathematical  require¬ 
ments,  which  barred  a  number  of  psychometr ically  oriented  researchers  from  thor¬ 
oughly  understanding  the  methods. 

Latent  trait  test  theories  have  been  applied  and  developed  under  several 
rubrics.  Most  well  known  are  item  characteristic  curve  theory,  and,  more  re¬ 
cently,  item  response  theory  (IRT).  The  latter  is  used  here  because  it  empha¬ 
sizes  the  psychologically  based  nature  of  these  theories. 

In  an  attempt  to  make  IRT  more  useful  and  more  widely  understood  by  practi¬ 
tioners,  several  articles  during  the  review  period  have  provided  a  basic  intro¬ 
duction  to  IRT.  Hambleton  and  Cook  (1977),  in  their  introductory  article  to  a 
very  useful  special  issue  of  the  Journal  of  Educational  Measurement,  provided  a 
brief  introduction  to  IRT.  A  more  comprehensive  and  relatively  nontechnical 
review  for  the  uninitiated  was  provided  by  Hambleton,  Swaminathan,  Cook,  Eignor, 
and  Gifford  (1978).  Marco  (1977)  gives  practical  examples  of  applications  of 
IRT,  as  suggested  by  Lord  (1977c);  these  include  practical  examples  of  designing 
a  multi-purpose  test  using  the  information  curves  of  IRT,  evaluating  a  multi¬ 
level  test,  and  equating  tests  on  the  basis  of  pre-test  statistics. 

IRT  models  are  usually  differentiated  by  the  number  of  parameters  estimated 
for  the  items  and  the  nature  of  the  item  characteristic  curve  or  item  response 
function  (IRF).  IRFs  are  usually  assumed  to  be  either  normal  or  logistic 
ogives.  Since  there  is  a  high  degree  of  similarity  between  the  two  (although 
some  differences  in  practical  applications;  see  Kingsbury  &  Weiss  1979b),  that 
distinction  will  be  ignored  here,  and  the  logistic  ogive  will  be  assumed.  Thus, 
the  basic  differentiations  among  the  models  are  in  the  number  of  parameters  nec- 
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essary  Co  describe  Che  shape  and  locaCion  of  Che  IRF.  A  special  case  of  general 
IRT,  Che  1-parameter  logisCic  model,  is  also  known  as  Che  Rasch  model,  having 
been  independenCly  developed  by  Che  Danish  maChemaCician.  In  Chis  model  Che 
CesC  iCems  are  described  only  in  Cerms  of  Cheir  difficulCies,  and  cheir  discrim- 
inacions  are  assumed  Co  be  equal.  The  usual  2-parameCer  model  (as  will  be  seen, 
chere  is  a  special  2-parameter  case  of  Che  Rasch  model)  describes  iCems  both  by 
cheir  difficulCies  and  by  Cheir  discriminations;  and  Che  3-parameCer  model  adds 
a  chance  level  or  a  pseudo-guessing  parameCer  as  Che  chird  descripCor  of  Che 
IRF. 

1-ParameCer  LogisCic  Model 

Estimation,  model  fic,  and  equating.  The  1-parameter  logistic  (1PL)  model 
has  generated  a  substantial  amount  of  research  during  the  review  period.  As  is 
characteristic  of  research  on  IRT  models,  much  of  the  basic  research  has  been 
focused  on  problems  of  item  parameter  estimation.  Since  the  1PL  model  parameter 
estimation  procedure  involves  estimating  only  the  difficulty  parameters  for 
items  along  with  the  ability  parameters  for  individuals,  these  two  parameters 
are  usually  estimated  simultaneously.  However,  because  of  some  mathematical 
problems,  they  can  only  be  approximated  under  certain  circumstances:  Cohen 
(1979)  provides  noniterative  procedures  for  estimating  ability  and  difficulty 
that  gives  values  similar  to  the  maximum  likelihood  procedures  usually  used  for 
this  process.  Wright  and  Douglas  (1977a,  1977b)  compare  different  procedures 
for  estimating  these  parameters,  as  do  Anderson  and  Madsen  (1977);  and  Anderson 
(1977)  verifies  that  the  number-correct  score  is  a  minimal  sufficient  statistic 
in  M-C  tests  for  estimating  trait  levels.  He  also  demonstrates  that  the  num¬ 
ber-correct  score  in  the  1PL  model  is  not  a  function  of  the  item  difficulties 
used  in  the  test,  whereas  Kearns  and  Meredith  (1975)  provide  Bayesian  procedures 
for  point  estimates  of  1PL  model  scores.  Their  procedure  is  an  empirical  Bayes 
procedure,  which  like  all  such  procedures  is  sample  dependent  and  only  efficient 
with  large  sample  sizes. 

The  problem  of  parameter  estimation  in  these  models  relates  to  the  question 
of  fit  of  the  models  to  data.  One  important  feature  of  the  IRT  models,  and  par¬ 
ticularly  the  1PL  model,  is  that  procedures  are  available  for  testing  the  fit  of 
data  to  the  models.  If  items  do  not  fit  the  model,  they  can  be  eliminated  and  a 
set  of  items  can  be  identified  that  do  fit  the  model.  When  a  set  of  model-fit¬ 
ting  items  is  identified,  they  permit  the  use  of  number-correct  score  as  an  in¬ 
dicator  of  trait  level  in  the  1PL  model.  However,  there  has  been  some  question 
about  the  utility  of  tests  that  do  fit  the  1PL  model. 

Wood  (1978)  fit  the  1PL  model  to  simulated  coin  tosses  of  500  subjects  on 
50  variables,  a  reasonable  ratio  of  subjects  to  variables  to  adequately  estimate 
the  parameters  of  the  1PL  model.  He  found  that  he  was  unable  to  reject  47  of 
the  50  items  for  lack  of  fit  at  the  95X  confidence  level  of  the  chi-square  test 
usually  used  to  test  the  fit  of  items  to  the  model.  Thus,  his  data  suggest  that 
random  data  would  not  show  substantial  nonfit  to  the  model.  His  data,  however, 
indicated  that  the  discriminations  of  the  items  were  low  and  that  the  ability 
estimates  were  essentially  the  same  for  all  his  simulees.  His  conclusion,  how¬ 
ever,  is  that  a  demonstration  of  lack  of  nonfit  by  itself  is  not  good  enough, 
since  most  of  his  randomly  derived  items  fit  the  model. 

The  problem  here,  of  course,  is  that  on  the  basis  of  a  lack  of  nonfit,  some 
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users  of  the  1PL  model  would  be  tempted  to  conclude  that  the  model  does  fit  the 
data;  item  discriminations  would  be  set  at  1  and  use  of  the  model  would  contin¬ 
ue.  However,  Wood's  data  suggests  that  this  would  be  inappropriate,  since  the 
true  item  discriminations  were  very  low  and  setting  them  equal  to  1  would  result 
in  inappropriate  discrimination  values  and  inappropriate  indications  of  the  er¬ 
ror  of  measurement  for  the  testees.  Thus,  additional  research  is  indicated  on 
methods  for  testing  fit  of  data  to  the  1PL  model. 

One  of  the  major  advantages  of  IRT  models,  including  the  1PL  model,  is  the 
promise  of  being  able  to  measure  individuals  on  the  same  ability  scale,  regard¬ 
less  of  the  difficulty  of  the  subset  of  items  on  which  they  are  measured.  This 
invariance  of  ability  estimates  over  item  subsets  implies  the  capability  of  IRT 
models  to  equate  measurements  from  different  tests,  a  problem  that  is  not  ade¬ 
quately  solvable  with  classical  test  theory.  Thus,  the  usefulness  of  the  1PL 
model  for  vertical  equating  has  been  investigated  by  Slinde  and  Linn  (1977b). 
Their  data  suggest  problems  in  the  use  of  the  1PL  model  for  vertical  equating, 
since  they  found  mean  differences  in  ability  estimates  based  on  high  or  low 
ability  calibrations  in  their  cross-validation  groups,  with  greater  differences 
for  ability  levels  in  the  calibration  groups  that  were  farther  apart.  They  sug¬ 
gest  that  perhaps  more  item  parameters  are  necessary  to  do  a  good  job  of  verti¬ 
cal  equating. 

Gustafsson  (1979)  suggested  that  the  Slinde  and  Linn  (1979a)  results  caused 
a  spurious  lack  of  fit  to  the  1PL  model  by  selecting  levels  of  examinee  perfor¬ 
mance  on  subsets  of  items  only,  resulting  in  a  regression  effect  that  could  have 
caused  the  obtained  results.  He  does  admit,  however,  that  there  may  be  problems 
in  vertical  equating  in  the  1PL  model  if  guessing  exists,  since  this  would  in¬ 
troduce  a  correlation  between  item  difficulty  and  item  discrimination.  Slinde 
and  Linn  (1979a)  analyzed  their  data  in  an  attempt  to  eliminate  the  regression 
effect  suggested  by  Gustafsson  (1979),  using  a  different  data  set.  Their  re¬ 
sults  supported  their  earlier  data  indicating  problems  in  vertical  equating  with 
the  1PL  model  since  the  item  parameters  estimated  on  high  and  low  groups  result¬ 
ed  in  different  mean  ability  estimates.  The  problems  were  mainly  characteristic 
of  the  low  ability  group  and  therefore  may  be  due  to  guessing,  which  is  support¬ 
ed  by  a  negative  correlation  between  the  difficulties  and  the  discriminations  in 
their  low  ability  group.  They  do,  however,  concede  that  the  1PL  model  may  be 
useful  for  equating  in  less  extreme  situations  than  used  in  their  data.  This  is 
confirmed  by  their  later  results  (Slinde  &  Linn  1979b),  which  support  the  use  of 
the  1PL  model  in  vertical  equating  for  relatively  contiguous  ability  levels  but 
not  for  those  which  were  further  apart.  Rentz  and  Bashaw  (1977)  also  illustrate 
the  use  of  the  1PL  model  for  equating  using  a  linking  test  of  common  items. 

Per son-free /sample- free  measurement.  Because  the  1PL  IRT  model  promises 
measurement  that  is  free  of  the  influence  of  a  specific  group  of  testees  or  a 
specific  subset  of  test  items,  in  contrast  to  the  sample-specific  measurement  of 
CTT,  considerable  research  continues  on  these  capabilities  of  the  model,  inde¬ 
pendent  of  the  equating  problem.  Tinsley  and  Dawis  (1975)  found  the  1PL  easi¬ 
ness  parameters  and  ability  estimates  to  be  invariant  over  samples  of  testees 
differing  in  ability  level  for  tests  of  25  or  more  items.  However,  their  re¬ 
sults  indicated  that  the  easiness  parameter  estimates  were  no  more  invariant 
than  the  ^-transformed  proportion-correct  difficulty  values.  Their  data  (Tins¬ 
ley  &  Dawis  1977)  also  support  the  test-free  characteristic  of  the  1PL  model  in 
that  ability  estimates  for  individuals  did  not  differ  substantially  when  they 
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were  based  on  item  subsets  of  different  difficulty  levels  selected  from  the  same 
tests.  Their  data  also  indicate  that  the  1PL  model  fit  M-C  items  (but  recall 
Wood's,  1978,  study  of  the  chi-square  test  of  fit)  after  testees  with  scores  in 
the  guessing  range  (10X  to  15%  of  the  sample)  were  eliminated. 

Dinero  and  Haertl  (1977)  studied  the  applicability  of  the  1PL  model  when 
item  discriminations  varied,  generating  testee  response  data  from  the  3-parame- 
ter  model  and  then  fitting  the  IP  model.  The  results  indicated  that  when  the 
distribution  of  item  discriminations  was  uniform,  the  IP  model  did  not  fit  the 
data,  but  when  there  were  substantial  numbers  of  items  with  similar  discrimina¬ 
tions  (normal  or  skewed  distributions  of  discriminations),  the  fit  of  the  model 
was  good.  Again,  these  fit  studies  should  be  interpreted  with  regard  to  Wood's 
(1978)  study  of  the  test  of  fit.  Whitely  and  Dawis  (1976)  found  1PL  item  diffi¬ 
culty  parameter  estimates  to  differ  as  a  function  of  test  context,  thereby  ques¬ 
tioning  the  invariance  characteristics  of  the  1PL  model.  Whitely  (1977),  in 
response  to  the  paper  by  Wright  (1977a),  which  gives  some  insights  into  some 
aspects  of  the  1PL  model,  agrees  with  Wood's  (1978)  later  conclusion  that  the 
chi-square  test  of  the  fit  of  the  model  has  little  power  for  small  samples  and 
does  not  do  well  for  sample  sizes  even  up  to  800. 

Thus,  the  data  on  the  robustness  of  the  1PL  model  are  equivocal,  since  some 
studies  support  invariance  of  ability  estimates  and  item  parameter  estimates 
over  item  and  person  sampling,  whereas  others  suggest  that  the  invariance  may 
not  be  as  great  as  promised  by  the  model.  The  interpretation  of  these  results, 
however,  is  clouded  by  the  problems  of  determining  the  fit  of  the  data  to  the 
model,  and  substantial  additional  research  is  necessary  on  this  issue  before 
questions  of  the  invariance  of  the  model  can  be  adequately  investigated. 

Other  deve lopment s .  Because  the  IP  model  has  frequently  been  used  with 
existing  M-C  tests,  Keats  (1974)  developed  a  1PL  model  with  guessing.  White 
(1976)  derives  this  model  from  a  CTT  approach,  but  Colonius  (1977)  indicates 
that  Keats'  model  results  in  no  consistent  maximum  likelihood  estimate  for  its 
parameters. 

One  advantage  of  IRT  models  is  their  ability  to  generalize  beyond  the  bina¬ 
ry  responses  commonly  obtained  from  M-C  tests  to  take  into  account  information 
in  incorrect  responses  to  test  items.  Anderson  (1977)  and  Andrich  (1978a, 

1978b)  discussed  generalizations  of  the  1PL  model  to  polychotomous  items,  which 
result  in  a  successive  integers  scoring  technique.  Douglas  (1978)  develops  es¬ 
timation  procedures  for  Andrich's  model.  One  important  characteristic  of  these 
models  is  that  like  the  dichotomous  case  of  the  1PL  model,  integer  scoring  using 
equally  distant  weights  preserves  the  1PL  model  characteristics.  As  a  conse¬ 
quence,  complex  scoring  procedures,  such  as  are  characteristic  of  the  other  IRT 
models  are  not  required. 

2-  and  3-Parameter  IRT  Models 


The  2-parameter  (2P)  and  3-parameter  (3P)  IRT  models  are  simply  generaliza¬ 
tions  of  the  1PL  model,  including  additional  parameters  that  describe  aspects  of 
the  IRF.  The  2P  model  permits  items  to  vary  in  the  discrimination  parameter, 
and  the  3P  model  adds  the  lower  asymptote  (pseudo-chance  value)  to  the  IRF. 

Being  generalizations  of  the  IP  model,  the  applications  and  utility  of  these 
models  are  essentially  the  same.  That  is,  they  have  the  capability  of  providing 
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sample-free  measures  of  individuals,  resulting  in  the  same  degree  of  "objectivi¬ 
ty"  as  does  the  1PL  model.  These  IRT  models  also  permit  the  measurement  of  in¬ 
dividuals  with  any  subset  of  items,  although  the  number-correct  score  for  these 
models  does  not  convey  the  same  information  as  it  does  for  the  1PL  model.  Con¬ 
sequently,  new  scoring  methods  have  been  developed  to  implement  these  models,  as 
have  additional  methods  for  the  estimation  of  item  parameters  (Urry  1976;  Wood, 
Wingersky,  &  Lord  1976). 

One  major  advantage  of  IRT  models  is  that  the  concept  of  reliability  is  not 
emphasized.  The  consequence  is  that  all  the  confusion  that  has  been  engendered 
with  regard  to  this  concept  in  classical  test  theory  disappears;  and  issues  of 
homogeneity,  internal  consistency,  type  of  reliability  coefficient,  and  lower 
bounds  are  eliminated.  In  place  of  reliability,  IRT  uses  the  concept  of  infor¬ 
mation  or  precision  of  measurement,  which  is  related  to  the  standard  error  of 
measurement  (or  estimate)  for  a  given  level  of  a  trait.  Consequently,  IRT  per¬ 
mits  the  error  of  measurement  to  vary  as  a  function  of  the  variable  being  mea¬ 
sured,  and  information  and  its  derivatives  (the  conditional  standard  error  of 
measurement  or  estimate)  index  this  change  in  precision  of  measurement  as  a 
function  of  the  trait  being  measured. 

Samejima  (1977a,  1977b)  differentiates  various  aspects  of  the  information 
function  and  provides  critiques  of  the  concept  of  reliability.  She  also  devel¬ 
ops  the  concept  of  weakly  parallel  tests  (Samejima,  1977a),  tfiich  are  tests  that 
have  similar  information  functions  but  do  not  require  the  number  of  items,  score 
categories,  or  other  aspects  of  the  tests  to  be  similar.  This  redefinition  of 
parallel  tests  permits  not  only  the  easier  design  of  parallel  tests  for  applied 
purposes  but  the  conceptual  definition  of  parallel  adaptive/tailored  tests. 
Samejima  also  provides  criticisms  of  the  classical  standard  error  of  measure¬ 
ment,  indicating  its  group  dependency  (whereas  the  standard  errors  of  measure¬ 
ment  of  IRT  are  not  group  dependent)  and  its  dependence  upon  the  heterogeneity 
of  the  group  with  regard  to  the  trait  being  measured. 

Parameter  estimation  and  equating.  As  with  the  IP  IRT  models,  an  important 
problem  is  the  development  of  accurate  methods  for  estimating  the  parameters  of 
test  items.  This  is  somewhat  more  complex  in  the  2P  and  3P  models,  since  the 
problem  becomes  one  of  simultaneously  estimating  two  or  three  parameters  for 
each  item  plus  an  ability  (trait)  parameter  for  each  person  in  the  item  calibra¬ 
tion  sample.  Jensema  (1976)  proposes  a  direct  conversion  method  for  estimating 
IRT  parameters  from  the  item  parameters  of  CTT.  Schmidt  (1977)  evaluates  a 
graphical  method  of  direct  conversion  proposed  earlier  by  Urry  (1976).  Ree 
(1979)  compares  four  methods  of  estimating  IRF  parameters  and  concludes  that  no 
one  of  the  procedures  was  consistently  best,  since  the  results  obtained  depended 
upon  the  characteristics  of  the  data,  while  Waller  (1980)  studied  yet  another 
item  parameterization  approach  under  conditions  of  nonsymmetric  distributions  of 
ability.  Samejima  (1977a)  describes  a  method  of  estimating  the  parameters  of 
IRFs  when  previous  estimates  of  ability  are  available  for  a  group  of  individu¬ 
als. 


Similar  to  the  IP  model,  there  have  been  several  studies  of  the  robustness 
of  the  2P  and  3P  models  under  a  variety  of  conditions.  Ree  and  Jensen  (1980) 
studied  the  effects  of  errors  in  item  parameters  on  linear  equating  while  Ham- 
bleton  and  Cook  (1980)  studied  the  robustness  of  the  models  under  a  variety  of 
conditions,  as  well  as  the  effects  of  test  length  and  sample  size  on  estimates 
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of  the  precision  of  latent  trait  ability  scores.  Reckase  (1979)  addressed  his 
attention  to  the  effects  of  multidimensionality  in  an  item  pool  on  item  parame¬ 
ter  estimates  obtained  in  the  IP  versus  3P  models,  and  Lord  (1975c)  solved  an 
empirical  problem  of  the  correlation  between  difficulty  and  discrimination  pa¬ 
rameters  by  redefining  the  ability  scale  onto  a  different  metric.  All  these 
studies  assist  in  gaining  a  better  understanding  of  the  potential  of  1RT  models 
to  perform  adequately  under  a  variety  of  situations. 

The  3P  models  have  also  been  applied  to  the  problem  of  score  equating  and 
linking  of  items  into  larger  pools.  Marco,  Petersen,  and  Stewart  (1980)  exam¬ 
ined  the  adequacy  of  IRT  score  equating  models  when  sample  and  test  characteris¬ 
tics  are  systematically  varied;  Yen  (1980)  studied  the  effects  of  context  on 
item  parameter  and  trait  estimates;  and  Ree  and  Jensen  (1980)  studied  the  ef¬ 
fects  of  errors  in  item  parameter  estimates  on  linear  equating. 

Applications:  Option  weighting,  adaptive/tailored  testing.  One  of  the 
problems  that  has  not  adequately  been  resolved  by  CTT  is  the  problem  of  extract¬ 
ing  additional  information  from  the  responses  of  testees  to  the  incorrect  op¬ 
tions  on  a  M-C  item.  Thissen  (1976)  addressed  this  problem  directly  using  a 
polychotomous  IRT  model  and  found  that  it  gave  one-third  to  one-half  more  infor¬ 
mation  than  did  the  dichotomous  model  applied  to  the  same  data;  an  interesting 
subsidiary  finding  was  that  the  reliability  of  the  two  models  did  not  differ 
substantially,  indicating  the  ineffectiveness  of  reliability  as  an  index  of  the 
utility  of  different  approaches  to  scoring  items.  Samejima  (1977c)  described 
another  application  of  polychotomous  latent  trait  IRT  approaches.  Be jar  (1977) 
applied  the  continuous  IRT  model  to  personality  assessment  and  found  a  good  fit 
of  the  model  to  some  of  the  personality  data.  HiB  results,  also  evaluated  in 
terms  of  information  or  precision  of  measurement,  show  considerable  gains  by  use 
of  this  model  over  the  usual  dichotomous  model. 

Adaptive  testing  is  the  interactive  administration  of  tests  such  that  items 
are  selected  dynamically  for  each  individual  contingent  upon  the  individual's 
responses  to  previous  test  items.  Adaptive  testing  requires  immediate  scoring 
of  each  response  and  some  means  of  selecting  the  next  item  to  be  administered  on 
the  basis  of  response  information  and/or  ability  estimates  determined  for  each 
individual  on  an  itera-by-itera  basis.  Although  adaptive  testing  does  not  require 
IRT  (Brooks  &  Hartz  1978;  Hornke  &  Sauter  1980;  Vale  &  Weiss  1975;  Waters  1977; 
Weiss  1974),  IRT  has  facilitated  the  development  and  implementation  of  most 
adaptive  testing  strategies.  The  review  period  has  seen  considerable  progress 
on  adaptive  testing  and  the  implementation  of  the  2P  and  3P  IRT  models.  Two 
major  conferences  have  provided  a  forum  for  the  discussion  of  current  research 
in  this  field  (Weiss  1978,  1980),  while  others  (Jensema  1977;  Lord  1977a;  Mc¬ 
Bride  1977;  Urry  1977)  have  pursued  basic  and  applied  research  on  the  develop¬ 
ment  and  evaluation  of  a  variety  of  adaptive  testing  strategies.  These  studies 
show,  in  general,  that  IRT  combined  with  adaptive  testing  techniques  is  a  viable 
methodology  for  the  improvement  of  tests  of  ability  and  achievement  and  has  con¬ 
siderable  promise  for  the  replacement  of  paper-and-penc il  tests  with  computer- 
administered  adaptive  tests  in  the  foreseeable  future. 

As  might  be  expected,  a  few  studies  have  been  concerned  with  comparisons  of 
IRT  and  CTT  approaches.  Douglas,  Kahalil,  and  Farber  (1979)  compared  CTT  and 
IRT  item  analysis  procedures  by  selecting  items  using  traditional  proportion- 
correct  and  item-total  biserial  correlations  versus  item  selection  based  on  1PL 
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procedures.  Their  data  show  that  about  half  the  items  were  selected  in  common 
by  the  two  procedures,  some  were  selected  by  neither,  and  some  by  either.  The 
correlation  of  proportion  correct  with  1PL  difficulty  was  .997,  while  1PL  abili¬ 
ty  estimates  correlated  .91  with  total  score  and  .81  with  score  on  the  items 
selected  by  the  CTT  item  selection  procedure.  In  terms  of  validity,  neither 
correlated  differently  with  the  criterion  score.  Their  conclusion  was  that  the 
two  procedures  define  "different  constructs,"  but  there  was  no  data  to  indicate 
which  more  adequately  defined  the  trait  desired.  Lord  (1977b)  compared  an  IRT 
approach  with  three  other  approaches  in  the  evaluation  of  the  optimal  number  of 
choices  in  a  test  item.  His  results  show  that  decreasing  the  number  of  choices 
per  item,  while  lengthening  the  test  proportionately,  decreases  the  efficiency 
for  low  ability  testees  and  increases  the  efficiency  for  high  ability  testees; 
his  data  also  show  that  reliability  comparisons  of  the  methods  do  not  demon¬ 
strate  differences,  whereas  comparisons  in  terms  of  information  (efficiency) 
describe  differences  in  the  characteristics  of  the  different  items. 

Relationships  with  other  psychometric  models.  One  of  the  potentially  most 
valuable  contributions  of  IRT  to  psychological  measurement  is  reflected  in  a 
series  of  papers  relating  the  logic  and  procedures  of  IRT  models  to  the  main¬ 
stream  of  psychological  measurement.  A  major  deficiency  of  CTT  has  been  in  the 
separation  of  its  logic  and  methodology  from  the  other  methods  of  psychological 
measurement.  The  methods  of  CTT  are  unique  to  that  approach  and  have  never  been 
demonstrated  to  derive  from  or  relate  to  any  other  models  of  psychological  mea¬ 
surement.  However,  recent  and  important  research  during  the  review  period  has 
defined  and  described  the  continuity  of  the  logic  of  IRT  approaches  with  a  vari¬ 
ety  of  other  approaches  to  psychological  measurement. 

That  IRT  approaches  are  a  special  case  of  Thurstone's  scaling  techniques  is 
well  demonstrated  by  Lumsden  (1980),  Brogden  (1977),  and  Andrich  (1978d).  Wain- 
er,  Fairbank,  and  Hough  (1978)  analyzed  a  data  set  by  both  Thurstone  scaling 
methods  and  the  1PL  model  and  demonstrate  the  similarity  of  the  results.  Per- 
line,  Wright,  and  Wainer  (1979)  and  Brogden  (1977)  describe  relationships  be¬ 
tween  IRT  models  and  additive  conjoint  measurement.  Finally,  an  IRT  model  that 
implements  the  standard  Likert  successive  integers  attitude  scaling  approach  has 
been  developed  (Andersen  1977;  Andrich  1978a,  1978b,  1978c;  Douglas  1978). 

Thus,  by  the  use  of  IRT  models,  researchers  can  be  assured  of  some  continuity 
between  test  theory  and  other  areas  of  psychological  scaling. 

Person  fit.  A  major  advantage  of  IRT  models  is  the  possibility  of  deter¬ 
mining  whether  a  person  (or  item)  is  performing  in  accordance  with  the  assump¬ 
tions  of  the  models.  Since  the  models  make  strong  assumptions  about  the  behav¬ 
ior  of  individuals  and  items,  it  is  necessary  to  determine  whether  both  individ¬ 
uals  and  items  fit  a  given  version  of  the  model  in  order  to  adequately  use  it. 

If  a  set  of  individuals  and  items  can  be  determined  to  be  operating  in  accor¬ 
dance  with  the  model,  strong  inferences  can  be  made  on  the  basis  of  the  data  and 
all  of  the  power  of  the  models  can  be  put  to  practical  use.  If  the  responses  of 
an  individual  (or  a  set  of  individuals  to  an  item)  do  not  fit  the  model,  it  can 
be  concluded  that  the  model  is  an  inappropriate  means  of  describing  the  behavior 
of  that  individual  on  that  set  of  items  (or  that  item  on  that  set  of  individu¬ 
als);  this  kind  of  statement  can  be  also  translated  into  a  matter  of  degree  of 
person  (or  item)  fit,  which  can  potentially  lead  to  indices  of  precision  of  mea¬ 
surement  for  a  given  individual.  Indeed,  IRT  permits  the  statement  of  the  error 
of  measurement  associated  with  a  unique  set  of  responses  of  an  individual  to  a 
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set  of  test  items.  These  data  can  also  be  used  to  study  the  fit  of  individuals 
to  a  set  of  test  items  and,  hence,  to  the  assumptions  underlying  IRT. 

Most  of  the  work  on  person  fit  has  been  done  with  the  IP  model.  This  work 
is  well  described  by  Wright  (1977b),  Wright  and  Stone  (1980),  and  Wainer  and 
Wright  (1980).  The  approach  generally  used  in  the  IP  model  involves  a  chi- 
square  test  fit  of  a  persons  by  items  response  matrix  to  the  predicted  probabil¬ 
ities  from  the  IP  model.  Lumsden  (1977,  1978)  generalizes  the  issue  to  one  of 
person  reliability.  He  defines  the  person  characteristic  curve  (PCC),  which  has 
relationships  to  the  observed  data  values  used  in  the  IP  chi-square  index  of 
person  fit,  and  describes  how  the  IRF  and  group  reliability  are  functions  of  a 
series  of  PCCs.  The  idea  of  the  PCC  is  redefined  by  Trabin  and  Weiss  (1979)  as 
the  person  response  curve  (PRC)  to  emphasize  that  it  results  from  the  responses 
of  one  individual  to  a  set  of  test  items.  The  PRC  is  traced  back  to  work  in  the 
1940s  by  Mosier,  and  some  of  the  implications  of  it  for  the  measurement  of  per¬ 
son  fit  are  described.  Trabin  and  Weiss  derive  the  PRCs  for  a  group  of  testees 
and  proceed  to  test  the  fit  of  those  testees  to  the  3P  model.  The  results  indi¬ 
cate  an  overwhelming  fit  of  these  individuals  to  the  model,  with  the  identifica¬ 
tion  of  a  few  individuals  who  appear  to  have  systematic  lack  of  fit  for  various 
reasons.  Levine  and  Drasgow  (1980)  and  Levine  and  Rubin  (1979)  call  the  problem 
one  of  measuring  "appropriateness"  of  M-C  test  scores.  They  define  a  series  of 
appropriateness  (person  fit)  indices  and  study  the  application  of  these  indices 
via  monte  carlo  simulations,  in  addition  to  real  data.  Their  data  illustrate 
the  potential  of  some  of  their  indices  to  identify  lack  of  fit  of  individuals  to 
IRT  models. 

Thus,  this  new  area  of  research,  which  has  developed  during  the  review  pe¬ 
riod,  promises  to  be  an  especially  important  one  for  future  applications,  since 
it  will  permit  the  identification  of  individuals  for  whom  IRT  does  not  adequate¬ 
ly  describe,  their  behavior  in  a  testing  situation.  The  result  will  be  state¬ 
ments  of  individual  precision  for  the  test  score  of  one  person  on  a  set  of 
items,  possibly  resulting  in  an  important  moderator  variable  to  be  used  in  pre¬ 
diction  studies  to  improve  predictive  validity. 

Order  Models 


Another  new  area  of  research  that  has  appeared  during  the  last  five  years 
is  the  application  of  order-based  models  to  the  development  of  psychological 
measuring  instruments.  To  differentiate  them  from  ordinal  test  theory  models, 
these  models  are  based  on  the  logical  relationships  among  item  responses  (and 
individuals)  utilizing  items  by  persons  dominance  matrices.  The  methodologies 
have  relationships  with  mathematical  information  theory  (Krus  &  Ceurvorst  1979) 
and  have  their  basic  psychometric  roots  in  earlier  work  by  Guttman  and  in  scalo- 
gram  analysis  (Airasian,  Madaus,  &  Woods  1975;  Bart  1976). 

The  majority  of  the  research  in  order  analysis  has  been  in  the  field  of 
attitude  scaling  in  the  analysis  of  the  structure  of  item/person  matrices  (Bart 
1978;  Krus  1977,  1978;  Krus  &  Weiss  1976)  and  in  the  analysis  of  instructional 
hierarchies  (Airasian  &  Bart  1975;  Bart  &  Mertens  1979).  Cliff  (1979),  however, 
has  translated  the  approach  into  a  test  theory  approach  that  does  not  assume 
true  scores.  It  is  interesting  to  note  that  this  approach  also  permits  expres¬ 
sions  of  person  consistency  similar  to  the  person  fit  approaches  in  IRT.  Cliff 
then  generalizes  the  application  of  his  order  theory  methods  to  adaptive  testing 
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(Cliff  1975,  1977;  Cliff,  Cudeck  &  McCormick  1979;  Cudek,  McCormick  &  Cliff 
1979),  vrtiereas  Baker  and  Hubert  (1977)  propose  some  inference  procedures  and 
hypothesis  testing  procedures  for  order  theory.  Initial  results  of  order  theory 
seem  promising,  but  additional  research  in  the  test  theory  area  is  necessary  to 
determine  the  degree  of  sample  specificity  of  this  approach  if  it  is  to  provide 
any  advantages  over  CTT.  Since  both  order  methods  and  IRT  methods  have  their 
ancestry  in  Guttman's  (1944)  scalogram  analysis,  some  thought  should  also  be 
given  to  the  relationships  between  the  two  methodologies. 

Miscellaneous  Models 


A  few  additional  new  developments  appeared  during  this  period  as  alterna¬ 
tives  to  CTT.  Wilcox  (1979e)  and  Morrison  and  Brockway  (1979)  discuss  applica¬ 
tions  of  the  beta-binomial  model  to  testing  problems.  Mellenbergh,  Kelderman, 
Stijlen,  and  Zondag  (1979)  develop  linear  models  for  the  analysis  and  construc¬ 
tion  of  measuring  instruments  using  a  facet  (factorial)  design,  a  special  appli¬ 
cation  of  covariance  structure  analysis  to  the  construction  and  analysis  of  mea¬ 
suring  instruments.  Their  approach  is  an  alternative  to  generalizabilty  theory 
(and  CTT)  and  permits  design  of  instruments  to  fit  a  hypothesized  facet-type 
structure.  McQuitty  (1976)  describes  an  item  analysis  procedure  based  on  con- 
figural  approaches,  while  Schulman  (1976,  1978;  Schulman  &  Haden  1975)  develops 
a  test  theory  for  ordinal  measurements  which  arrives  at  the  same  kinds  of  defi¬ 
nitions  of  reliability,  attenuation,  and  errors  of  measurement  as  does  CTT.  It 
differs  from  order  theory  approaches  in  that  it  is  basically  an  ordinal  theory 
based  on  total  scores  as  compared  to  the  order  theory  approaches  that  are  based 
on  logical  relationships  among  persons  and  items  at  the  item  level.  Finally, 
Whitely  and  Dawis  (1976)  present  and  apply  a  model  designed  to  psychometrically 
distinguish  the  concept  of  aptitude  (potential)  from  ability  (current  status). 
Their  data  suggest  that  the  predictability  from  later  stages  can  be  improved  by 
adding  the  gains  resulting  from  specific  interventions.  All  these  models  at¬ 
tempt  to  generalize  or  to  replace  the  deficiencies  in  the  CTT  model.  All,  how¬ 
ever,  will  require  additional  research  and  development  work  before  they  become 
useful . 


VALIDITY 

Content  and  Construct  Validity 

Two  seemingly  unrelated  phenomena--the  test  fairness  controversy  (see  be¬ 
low)  and  the  CRT  movement — have  heightened  interest  in  content  validity  (Schoen- 
feldt,  Schoenfeldt,  Acker,  &  Perlson  1976).  Some  believe  that  a  content  valid 
employment  test  or  success  criterion  is  inherently  fair.  Much  of  the  literature 
on  CRT  has  emphasized  the  content  validity  of  educational  achievement  tests  to 
the  exclusion  of  construct  and  criterion-related  validity.  The  heightened  in¬ 
terest  in  content  validity  has  led  to  a  controversy  about  when  or  vdiether  any 
test  can  be  judged  solely  on  the  basis  of  content  validity. 

Ebel  (1975)  argues  that  construct  validity  is  not  a  concern  if  the  behavior 
can  be  directly  observed  or  the  trait  can  be  operationally  defined.  In  opposi¬ 
tion  to  the  increased  emphasis  on  content  validity  in  educational  testing,  Mes- 
sick  (1975)  argues  that  construct  validity  is  as  important  for  educational  tests 
as  for  psychological  tests.  In  what  could  be  considered  a  response  to  Ebel 
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(1975),  he  points  out  the  logical  difficulties  associated  with  operational  defi¬ 
nitions.  Guion  (1977,  1978)  presents  his  reservations  about  the  increased  em¬ 
phasis  on  content  validity  in  employment  testing,  including  his  concern  that 
expert  judgments  about  content  validity  are  often  made  too  glibly.  He  goes  on 
to  list  six  conditions  which,  in  his  opinion,  a  test  must  meet  before  it  can  be 
judged  solely  on  the  basis  of  its  content  validity,  conditions  tftich  are  much 
more  stringent  than  those  of  Ebel  (1975).  Guion  (1974)  discusses  the  merits  and 
limitations  of  all  three  kinds  of  validity;  construct,  content,  and  criterion- 
related  validity.  Several  authors  have  considered  the  context  of  educational 
testing  rather  than  the  content  validity  of  single  tests:  Carver  (1974,  1975), 
Hoepfner  (1974),  and  Levine  (1976)  lament  the  fact  that  published  educational 
tests  tap  so  narrow  a  set  of  educational  objectives. 

Multitrait-Multimethod  Matrices 


Structural  equation  models  have  been  applied  to  the  study  of  multitrait- 
multimethod  (MTMM)  correlation  matrices  in  the  search  for  statistical  procedures 
useful  in  studying  aspects  of  construct  validity.  Ray  and  Heeler  (1975)  compare 
restricted  maximum  likelihood  factor  analysis  and  multidimensional  scaling  as 
methods  of  analyzing  MTMM  matrices.  According  to  Kalleberg  and  Kluegel  (1975), 
the  structural  equations  approach  has  the  advantage  that  (1)  it  allows  estima¬ 
tion  of  correlations  between  trait  and  method  factors,  (2)  it  provides  estimates 
of  both  trait  and  method  factor  influences  on  each  measure,  and  (3)  it  forces 
researchers  to  specify  their  assumptions.  Mellenbergh  et  al .  (1979)  note  that 
structural  equations  models  can  be  extended  to  the  study  of  any  test  facet  mod¬ 
el,  of  which  the  MTMM  model  is  one  example  and  Guilford's  (1967)  structure  of 
intellect  model  is  another. 

Avison  (1978)  and  Schmitt  (1978)  point  out  that  there  is  not  just  one  but 
several  structural  equations  models  for  studying  MTMM  matrices.  Schmitt  (1978) 
discusses  the  problem  of  choosing  between  possible  models  on  the  basis  of  their 
fit  to  the  data,  a  problem  that  is  only  partially  solved  at  present.  It  is  not 
clear  whether  the  choice  of  model  substantially  influences  the  conclusions 
reached. 

Methods  of  investigating  MTMM  matrices  that  do  not  rest  on  structural  equa¬ 
tions  models  have  been  described  by  Golding  (1977),  Golding  and  Seidman  (1974), 
Hubert  and  Baker  (1978,  1979),  Jackson  (1975,  1977),  Levin  (1974),  and  Lomax  and 
Algina  (1979).  After  reviewing  alternatives  to  the  structural  equations  ap¬ 
proach,  Schmitt,  Coyle,  and  Saari  (1977)  conclude  that  the  structural  equations 
models  provide  the  most  detailed  information  about  individual  traits  and  meth¬ 
ods  . 


The  structural  equations  model  for  MTMM  data  contains  an  implicit  defini¬ 
tion  of  method  variance,  a  term  which  Campbell  and  Fiske  (1959)  left  only  vague¬ 
ly  specified  (Golding  1977).  That  is,  method  variance  is  variance  attributable 
to  a  dimension  of  individual  differences  that  (1)  is  uniquely  associated  in  the 
factor  pattern  matrix  with  measures  employing  one  particular  method  of  measure¬ 
ment,  (2)  contributes  to  the  variance  of  any  measure  assessed  by  that  method, 
and  (3)  combines  in  an  additive  fashion  with  other  sources  of  variance.  Other 
definitions  are  possible.  In  Tucker's  (1966)  three-mode  factor  model  for  MTMM 
matrices,  for  instance,  traits  and  methods  combine  in  a  multiplicative  interac¬ 
tion  rather  than  in  an  additive  fashion.  Because  the  structural  equations  ap- 
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proaches  are  becoming  wideLy  accepted,  the  definitions  of  trait  and  method  vari¬ 
ance  implicit  in  those  models  deserve  closer  scrutiny  than  they  have  received  in 
the  past.  Tesser  and  Krauss  (1976)  remind  us  that  a  MTMM  is  not  the  only  way  to 
investigate  construct  validity. 


Predictive  Validity 

How  large  a  sample  size  is  needed  to  study  a  test's  predictive  validity? 

This  is  the  question  addressed  by  Cascio,  Valenzie,  and  Silbey  (1978)  and 
Schmidt  and  Hunter  (1977).  Schmidt  and  Hunter  argue  that  the  sample  sizes  need¬ 
ed  for  predictive  validity  studies  are  often  much  larger  than  commonly  recom¬ 
mended.  Because  the  observed  correlation  is  typically  reduced  by  such  influ¬ 
ences  as  restriction  in  range  and  criterion  unreliability,  large  sample  sizes 
are  needed  to  insure  adequate  power  in  statistical  tests  of  predictive  validity 
coefficients. 

Schmidt  and  Hunter  (1977)  and  Schmidt,  Hunter,  Pearlman,  and  Shane  (1979) 
argue  against  the  dominant  belief  that  the  predictive  validity  of  selection 
tests  is  highly  situation  specific.  Prior  research  has  revealed  considerable 
variation  in  the  observed  validity  coefficients  for  the  same  test  in  several  job 
settings.  Schmidt  and  his  coworkers  argue  that  most  of  this  variation  is  not 
due  to  fluctuations  in  the  true  validity  of  the  test.  Rather,  much  of  the  vari¬ 
ation  is  due  to  artifactual  sources,  including  variation  from  one  job  setting  to 
the  next  in  (1)  criterion  reliability,  (2)  test  reliability,  (3)  range  restric¬ 
tion,  and  (4)  criterion  contamination.  Because  of  the  small  sample  sizes  used 
in  many  validity  studies,  sampling  error  can  also  account  for  some  of  the  varia¬ 
tion.  Schmidt  and  Hunter  (1977)  propose  a  Bayesian  method  of  combining  validity 
coefficients  across  studies  on  the  same  job  family  to  arrive  at  pooled  estimates 
of  validity. 

As  Schmidt  and  his  coworkers  argue,  a  portion  of  the  variation  in  a  test's 
validity  coefficient  from  study  to  study  is  due  to  artifactual  sources  and  sam¬ 
pling  error.  How  much  is  due  to  those  sources?  Schmidt  and  his  coworkers  pile 
one  untested  assumption  upon  another  to  arrive  at  their  estimates  and  to  develop 
their  Bayesian  approach.  The  Bayesian  alternative  is  only  as  good  as  the  un¬ 
tested  assumptions;  and  it  presumes  a  satisfactory  method  of  classifying  tests 
into  job  families,  something  which  does  not  now  exist.  Callender  and  Osburn 
(1979)  and  Callender,  Osburn,  and  Greener  (1979)  propose  an  alternative  model 
that  leads  to  smaller  estimates  of  the  artifactual  variance  and  to  an  alterna¬ 
tive  Bayesian  approach.  Rock,  Werts,  Linn,  and  Joreskog  (1977)  provide  some 
possible  methodological  assistance  for  the  criterion-related  validity  problem  in 
their  structural  equations  model  that  partitions  criterion  variance  into  (1) 
measurement  error,  (2)  true  score  variance  accounted  for  by  the  predictor,  and 
(3)  true  score  variance  unaccounted  for  by  the  predictor. 

Although  Schmidt  and  Hunter's  (1977)  Bayesian  model  may  not  be  the  answer, 
their  work  raises  an  important  issue.  Given  the  often  unavoidable  limitations 
(particularly  limitations  of  sample  size)  in  job  specific  validity  studies, 
would  pooled  estimates  sometimes  be  better?  If  so,  under  what  conditions,  and 
how  should  the  several  job  specific  coefficients  be  pooled?  A  workable  taxonomy 
of  job  families  would  need  to  be  developed  before  job  pooling  could  become  ac¬ 
cepted  (Pearlman  1980). 
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Schulman  (1976)  presents  a  predictive  validity  model  for  ordinal  measure¬ 
ments.  A  surprising  number  of  authors  have  examined  the  validity  of  self-report 
ability  measures  (DeNisi  &  Shaw  1977;  Farrel  1979;  Levine  &  Flory  1977;  Norris 
1976;  Norris  &  Chapman  1976;  Pohlmann  &  Beggs  1974).  Hogan,  De  Soto,  and  Solono 
(1977),  Mischel  (1977),  and  Wade  and  Baker  (1977)  ponder  the  value  of  personali¬ 
ty  tests.  The  economic  impact  of  valid  selection  was  examined  by  Schmidt, 
Hunter,  McKenzie,  and  Muldrow  (1979). 

Moderator  and  Suppressor  Effects 

Lissitz  and  Schoenfeldt  (1974),  Gross,  Steckler-Faggen,  and  McCarthy 
(1974),  and  Novick  and  Jackson  (1974)  consider  the  problem  of  using  subgroup 
information  as  a  moderator  variable  in  prediction  equations.  Drosler  (1978) 
presents  a  scheme  for  increasing  the  temporal  range  of  psychometric  predictions. 
Conger  (1974)  and  Velicer  (1978)  attempt  to  improve  the  definition  of  suppressor 
variables  and  methods  for  dealing  with  them,  whereas  McFatter  (1979)  illustrates 
a  structural  equations  approach  for  interpreting  suppressor  and  enhancer  vari¬ 
ables.  Brown  (1979),  Greener  and  Osburn  (1979),  Gullickson  and  Hopkins  (1976), 
and  Roe  (1979)  consider  the  accuracy  of  corrections  for  restriction  in  range. 
Sands  and  Alf  (1978)  present  a  correction  for  restriction  in  range  that  does  not 
require  that  the  user  know  the  variance  of  the  predictor  in  the  applicant  popu¬ 
lation,  although  it  does  require  knowledge  of  the  selection  ratio.  Osburn  and 
Greener  (1978)  discuss  methods  of  sampling  selected  applicants  for  inclusion  in 
a  predictive  validity  study  when  criterion  information  is  too  difficult  or  ex¬ 
pensive  to  collect  from  all  selected  applicants. 

Educational  Applications 

Cronbach  and  Snow's  (1977)  book  on  aptitude  and  instructional  methods  is  a 
landmark  review  of  the  research  on  the  interaction  between  instructional  methods 
and  student  aptitudes.  The  authors  conclude  that  the  literature  contains  very 
few  examples  of  consistently  replicated  interactions  between  measured  aptitudes 
and  instructional  methods.  Hunt  (1975)  suggests  that  new  types  of  tests  will 
need  to  be  developed — tests  that  are  specifically  designed  to  assess  those  char¬ 
acteristics  that  interact  with  educational  methods.  Corno  (1979),  Tobias 
(1976),  and  Winne  (1977)  report  finding  isolated  aptitude-treatment  interactions 
of  various  types,  all  of  which  require  further  replication.  It  is  clear  from 
Cronbach  and  Snow's  (1977)  review  that  research  will  only  slowly  reveal  how  per¬ 
son  characteristics  and  instructional  methods  interact.  An  understanding  of 
such  interactions  would  greatly  enhance  the  ability  to  adapt  instruction  to  the 
learner's  needs. 

Airasian  and  Bart  (1975),  Dayton  and  MacReady  (1976),  Davison  (1980),  and 
Davison  and  Thoma  (1980)  describe  methods  for  studying  the  internal  structure  of 
tests  constructed  around  hypothesized  item  hierarchies.  Davison  (1977,  1979) 
has  discussed  methods  of  studying  the  interrelationships  between  subscales,  each 
of  which  corresponds  to  an  ordered  stage  in  a  developmental  sequence.  Applica¬ 
tions  of  these  techniques  can  be  found  in  Davison,  King,  Kitchener,  and  Parker 
(1980),  Davison  and  Robbins  (1978),  Davison,  Robbins,  and  Swanson  (1978),  and 
Jepsen  and  Grove  (1980). 
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Test  Fairness 


On  many  educational  and  occupational  selection  tests,  some  American  minori¬ 
ties — Blacks,  Hispanics,  Native  Americans,  and  some  Asian  Americans — form  popu¬ 
lations  with  lower  mean  scores  than  the  White  majority.  Not  all  ethnic  minori¬ 
ties,  however,  have  consistently  lower  mean  scores,  notably  Chinese  and  Japanese 
Americans.  For  those  minorities  with  lower  mean  scores,  the  result  can  be  a 
lower  rate  of  selection  for  jobs  or  educational  admission  if  selection  is  based 
heavily  on  tests.  However,  there  is  little  information  available  about  how 
heavily  test  information  influences  selection  decisions.  Without  such  informa¬ 
tion,  it  is  impossible  to  say  how  much  of  a  barrier  tests  have  actually  posed  to 
minorities  seeking  selection  to  jobs  or  admission  to  educational  institutions. 

Three  books  on  test  fairness  have  appeared  during  the  period  of  this  re¬ 
view.  Neither  Oaklund's  (1977)  nor  Samuda's  (1975)  books  scrutinize  alterna¬ 
tives  to  standardized  tests  with  the  same  critical  eye  with  vrtiich  they  evaluate 
traditional  tests.  Nor  do  they  present  the  case  in  favor  of  standardized  test¬ 
ing  with  the  same  thoroughness  with  vrtiich  they  present  the  case  against.  Jen¬ 
sen's  (1980)  defense  of  test  fairness  is  a  more  up-to-date  and  complete  treat¬ 
ment.  His  thoroughness  is  attested  to  by  the  fact  that  the  popular  press  seems 
to  draw  from  his  work  even  to  criticize  testing  (for  instance,  compare  Jensen 
1980,  p.  5,  with  Sewall,  Carey,  Simons,  and  Lord  1980,  p.  97). 

Fairness  of  tests  to  women  has  also  been  of  concern.  The  context  of  this 
discussion  is  quite  different,  however,  because  the  mean  scores  of  females  on 
many  tests,  particularly  verbal  aptitude  tests,  is  higher  than  that  of  men. 

Where  women  do  have  lower  average  scores,  the  differences  are  often  not  as 
marked  as  for  racial  minorities.  Maccoby  and  Jacklin's  (1974)  work  on  sex  dif¬ 
ferences  aids  in  understanding  the  discussion  of  test  fairness  to  women. 

Definitions  of  Test  Fairness 


Various  authors  have  proposed  definitions  of  bias  in  selection,  bias  in  a 
test,  and  bias  in  a  test  item. 

Bias  in  selection.  There  are  at  least  five  major  definitions  of  bias  in 
selection.  In  general,  no  selection  strategy  can  satisfy  all  of  the  fairness 
definitions.  According  to  Cleary  (1968,  p.  115),  a  test  is  biased  against  mem¬ 
bers  of  a  subgroup  "if  in  the  prediction  of  a  criterion  for  which  the  test  was 
designed,  consistent  nonzero  errors  of  prediction  are  made  for  members  of  the 
subgroup."  Ginhorn  and  Bass  (1971)  define  selection  as  fair  if  the  least  quali¬ 
fied  persons  who  would  be  accepted  from  each  subgroup  have  an  equal  chance  of 
succeeding.  Several  authors  define  fairness  in  terms  of  ratios.  Selection  can 
be  defined  as  fair  if  the  ratio  of  the  number  selected  to  the  number  qualified 
is  the  same  for  all  subgroups  (Thorndike  1971),  if  the  ratio  of  the  number  se¬ 
lected  and  qualified  to  the  number  qualified  is  the  same  for  all  groups  (Cole 
1973),  or  if  the  ratio  of  the  number  selected  and  qualified  to  the  number  se¬ 
lected  is  the  same  for  ail  groups  (Linn  1973).  Definitions  of  fairness  are  cri¬ 
tiqued  in  Bernal  (1975),  Cleary,  Humphrey,  Kendrick,  and  Wesman  (1975),  Cronbach 
(1976),  Darlington  (1976,  1978),  Flaugher  (1978),  Hunter  and  Schmidt  (1974, 

1976,  1978a),  McNemar  (1975),  Myers  (1975),  Novick  and  Ellis  (1977),  Peterson 
and  Novick  (1976),  Pine  and  Weiss  (1976),  and  Sawyer,  Cole,  and  Cole  (1976). 
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Peterson  and  Novick  (1976)  point  out  serious  logical  inconsistencies  in  the 
three  ratio  models.  Hunter,  Schmidt,  and  Rauschenbergcr  (1977)  note  that  the 
Cleary  (1968)  model  seems  to  be  the  only  one  adopted  by  the  U.S.  Equal  Employ¬ 
ment  Opportunity  Commission  (1970)  guidelines  on  employee  selection,  and  it  has 
been  upheld  in  a  U.S.  District  Court  decision  (Cortez  v.  Rosen,  Northern  Dis¬ 
trict  of  California,  March  11,  1975).  If  the  courts  tightly  limit  the  use  of 
race  in  selection,  as  Novick  and  Ellis  (1977)  suggest,  then  use  of  the  above 
fairness  models  would  be  correspondingly  limited.  Recent  U.S.  Supreme  Court 
decisions,  the  so-called  Bakke  and  Weber  decisions,  however,  suggest  that  racial 
information  can  be  used  even  by  institutions  that  have  no  prior  history  of  dis¬ 
crimination. 

A  decision  theoretic  approach  in  which  institutions  assign  utilities  to 
selection  outcomes,  utilities  that  may  vary  as  a  function  of  the  race  or  sex  of 
the  applicant,  have  been  endorsed  by  Darlington  (1976,  1978),  Gross  and  Su 
(1975),  Linn  (1976a),  Petersen  and  Novick  (1976),  and  Sawyer,  Cole,  and  Cole 
(1976).  As  a  criterion  for  selection  fairness,  decision  theory  is  hopelessly 
vague.  As  a  consequence,  it  can  be  used  to  discriminate  against  any  desired 
group  by  appropriately  assigning  utilities  to  outcomes.  Petersen  and  Novick 
(1976),  however,  show  that  a  decision  theoretic  framework  can  profitably  be  used 
to  evaluate  various  proposed  fairness  models.  Tools  used  to  equalize  the  pro¬ 
portion  of  majority  and  minority  members  selected  now  include  quotas  (Rybak 
1980),  bonus  points  for  minority  or  disadvantaged  applicants  (Roark  1978),  and 
separate  standardization  of  a  test  by  subgroups  so  that  the  test  has  the  same 
mean  in  minority  and  majority  subgroups  (Mercer  &  Lewis  1978). 

Not  all  definitions  of  bias  describe  bias  in  selection.  Jackson  (1975) 
presumes  that  Blacks  and  Whites  are  equal  in  ability  and,  therefore,  that  any 
test  is  biased  if  the  mean  scores  for  Blacks  and  Whites  are  different.  Faggen- 
Steckler,  McCarthy,  and  Tittle  (1974)  propose  a  measure  of  item  content  bias. 
Removing  bias  in  test  content,  however,  need  not  affect  differences  between  mean 
test  scores  of  groups.  Echternacht  (1974),  Ironson  and  Subkoviak  (1979),  and 
Scheuneman  (1979)  discuss  performance-based  measures  of  item  bias.  By  eliminat¬ 
ing  items  that  contain  bias  as  assessed  by  one  of  these  performance-based  mea¬ 
sures,  the  most  biased  items  in  the  test  may  be  eliminated.  After  pruning  such 
items,  however,  the  test  itself  will  be  unbiased  only  if  the  average  item  in  the 
original  item  pool  was  unbiased  (Green  1978).  Flaugher  and  Schrader  (1978) 
found  that  pruning  biased  items  did  not  substantially  alter  the  mean  difference 
between  minority  and  majority  students  and,  hence,  that  such  methods  of  pruning 
items  would  likely  not  materially  affect  the  adverse  impact  of  selection  deci¬ 
sions. 

Fairness  to  Minorities 


Empirical  studies  of  tests  administered  to  racial  or  ethnic  minorities  have 
focused  heavily  on  Blacks  and  to  a  lesser  extent  on  Mexican  Americans.  There 
was  much  less  research  on  Native  Americans,  Asian  Americans,  and  Non-Mexican 
Hispanic  populations.  The  most  thoroughly  researched  setting  was  the  college 
admissions  situation  in  which  the  predictors  are  high-school  grade-point  average 
(GPA)  and  scholastic  admissions  tests  and  in  which  the  criterion  is  college  GPA. 
Although  there  were  numerous  studies  of  employment  selection,  there  was  little 
consistency  in  the  predictors  and  criteria  employed.  There  are  some  general 
trends  in  the  employment  studies,  but  no  conclusions  can  be  drawn  about  specific 
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jobs,  tests,  or  criterion  variables.  As  has  been  noted  time  and  time  again  by 
authors  in  the  area,  the  research  strategies  assume  an  unbiased  criterion,  when 
in  practice  there  is  no  agreed  upon  standard  by  which  to  judge  the  criterion. 
Conclusions  to  be  drawn  from  the  research  described  below  depends  upon  whether 
the  criteria  employed  are  believed  to  be  biased  or  not  biased. 

Differential  and  single-groups  validity.  It  is  commonly  stated  that  tradi¬ 
tional  tests  are  less  valid  for  minority  applicants  than  for  nonminorities. 

Such  statements  have  given  rise  to  the  single-group  and  differential  validity 
issues.  Differential  validity  is  said  to  exist  trtien  predictive  validity  coeffi¬ 
cients  are  unequal  in  the  minority  and  majority  subgroups  (pt  <  P2).  When  the 
predictive  validity  coefficient  is  greater  than  zero  for  only  one  subgroup  (0  = 
Pj  <  P2  ),  then  single  group  validity  is  said  to  exist.  In  her  seminal  work, 
Boehm  (1972)  defined  single  group  and  differential  validity  differently,  but  her 
definitions  contain  logical  problems  (Bartlett,  Bobko,  &  Pine  1977;  Hunter  & 
Schmidt  1978b). 

The  research  on  differential  and  single-group  validity  has  been  plagued  by 
methodological  problems  including  non independence  of  observations,  statistical 
tests  with  low  power,  and  differential  restriction  of  range  in  majority  and  mi¬ 
nority  populations.  Bartlett,  Bobko,  and  Pine  (1977),  Bobko  and  Bartlett 
(1978),  Boehm  (1977,  1978),  Hunter  and  Schmidt  (1978b),  Hunter,  Schmidt,  and 
Hunter  (1979),  Katzell  and  Dyer  (1977,  1978),  and  O'Connor,  Wexley,  and  Alexan¬ 
der  (1975)  have  discussed  these  methodological  issues.  Conclusions  about  the 
differential  validity  hypothesis  have  varied,  depending  on  whether  or  not  the 
reviewer  included  data  on  tests  whose  validity  did  not  differ  significantly  from 
zero  in  any  sample  and  which,  hence,  would  not  be  used  as  a  selection  device. 

In  the  area  of  employment  selection,  Bobko  and  Bartlett  (1978),  Boehm 
(1977,  1978),  Gael,  Grant,  and  Ritchie  (1975a,  1975b),  Hunter  and  Schmidt 
(1978b),  Hunter,  Schmidt,  and  Hunter  (1979),  Linn  (1978b),  O'Connor,  Wexley,  and 
Alexander  (1975),  and  Reeb  (1976)  present  and  interpret  the  evidence  pertaining 
to  single  group  and  differential  validity.  Particularly  in  later  studies  (Boehm 
1977;  Hunter,  Schmidt,  &  Hunter  1979;  Katzell  &  Dyer  1977;  Linn  1978b;  O'Connor 
et  al .  1975),  authors  conclude  that  the  evidence  against  the  single  group  valid¬ 
ity  hypothesis  is  overwhelming.  Authors  still  differ  on  whether  or  not  examples 
of  differential  validity  occur  more  frequently  than  can  be  attributed  to  arti¬ 
facts  and  sampling  error.  Most  reviewers,  however,  conclude  that  examples  of 
differential  validity  are  rare  and  that  when  differences  in  validity  do  exist, 
they  are  usually  small.  Boehm  (1977)  found  that  the  most  methodologically  sound 
studies  reported  the  fewest  examples  of  differential  and  single  group  validity. 
Both  Bobko  and  Bartlett  (1978)  and  Linn  (1978b)  conclude  that  the  single  and 
differential  group  validity  issues  are  secondary  to  the  question  of  whether  or 
not  the  performance  of  minorities  is  systematically  underpredicted  by  tests.  We 
strongly  agree. 

In  the  educational  literature,  examples  of  large  differences  in  minority 
and  majority  validities  are  just  as  rare  as  in  the  employment  literature. 

Wright  and  Bean  (1974)  found  that  the  college  GPAs  of  high  socioeconomic  status 
(SES)  students  were  somewhat  better  predicted  than  those  of  low  SES  students. 
Pfeifer  (1976)  found  little  difference  in  the  predictability  of  Whites  and 
Blacks.  Breland  (1978)  and  Wilson  (1978)  concluded  that  the  traditional  predic¬ 
tors  of  college  GPA  are  generally  valid  predictors  for  both  majority  and  minor- 
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ity  students.  FLaugher  (1978)  argues  that  if  educational  examples  of  single 
group  and  differential  validity  are  so  difficult  to  find,  then  they  are  probably 
not  of  much  practical  import. 

Comparisons  of  regression  lines.  Although  some  researchers  have  been 
studying  differential  validity,  others  have  been  studying  test  fairness  as  de¬ 
fined  by  Cleary  (1968).  That  is,  they  have  been  examining  regression  lines  to 
determine  if  use  of  a  common  regression  line  for  both  majority  and  minority  sub¬ 
groups  would  result  in  over-  or  under-prediction  of  success  for  either  group. 
Goldman  and  his  coworkers  (Goldman  &  Hewitt  1976,  1975;  Goldman  &  Richards  1974; 
Goldman  &  Widawski  1976),  generally  found  no  evidence  for  bias  in  the  prediction 
of  college  grades  among  Blacks,  Whites,  Chicanos,  and  Orientals  in  the  Universi¬ 
ty  of  California  system.  In  one  exception  (Goldman  &  Hewitt  1975),  the  authors 
found  trivial  differences  between  regression  lines  for  Anglo  and  Mexican  Ameri¬ 
can  samples.  In  another  series  of  studies  at  California  colleges,  Warren  (1976) 
found  only  two  instances  in  which  regression  lines  were  significantly  different 
for  Anglo  and  Mexican  Americans.  In  one  case,  selection  was  biased  in  favor  of 
Mexican  Americans;  in  one  case  it  was  biased  against  them;  and  in  both  cases  the 
bias  was  small.  Cleary  et  al .  (1975)  reviewed  several  studies  comparing  regres¬ 
sion  lines  for  Blacks  and  Whites,  concluding  that  when  only  standard  courses  are 
included  in  the  college  GPAs,  differences  in  the  regression  lines  are  small  and 
favor  Blacks  more  often  than  Whites.  Silverman,  Barton,  and  Lyon  (1976)  found 
bias  in  favor  of  Blacks.  When  differences  exist,  it  is  usually  because  the  re¬ 
gression  lines  for  the  two  groups  have  different  intercepts  rather  than  because 
they  have  different  slopes. 

What  can  be  concluded  from  these  studies  based  on  Cleary's  (1968)  defini¬ 
tion  of  fairness?  In  the  published  literature,  the  evidence  suggests  that  tests 
do  not  consistently  underpredict  the  performance  of  minorities  on  traditional 
success  criteria  when  a  common  regression  line  is  used  for  both  the  majority  and 
minority  groups.  This  means  that  the  tests  are  no  more  or  less  biased  than  the 
criteria  they  are  designed  to  predict.  The  evidence  could  be  said  to  overwhelm¬ 
ingly  support  the  fairness  of  tests  were  it  not  for  lingering  doubts  about  the 
fairness  of  traditional  success  criteria.  There  is  a  pressing  need  to  define 
what  constitutes  a  fair  criterion  and  then  to  evaluate  traditional  success  cri¬ 
teria  against  that  definition.  Without  further  work  on  the  criterion  problem,  a 
more  definitive  answer  to  the  question  of  test  fairness  is  impossible  within  the 
Cleary  framework. 

Adverse  impact.  A  somewhat  different  approach  to  the  study  of  test  fair¬ 
ness  was  adopted  by  Hunter,  Schmidt,  and  Rauschenberger  (1977).  They  compared 
the  adverse  impact  and  validity  of  selections  based  on  four  fairness  strategies. 
Cleary's  (1968)  model  was  the  most  valid  and  a  quota  model  had  the  least  adverse 
impact.  The  roost  valid  models  also  had  the  greatest  adverse  impact  and  vice 
versa.  Breland  and  Ironson  (1976)  used  admissions  data  from  the  University  of 
Washington  Law  School  to  compare  the  Cleary  (1968),  Cole  (1973),  and  Thorndike 
(1971)  definitions  of  fairness  in  terms  of  the  number  of  minority  applicants  who 
would  be  selected,  using  selection  rules  satisying  each.  Differences  between 
fairness  models  were  small.  Because  none  of  the  fairness  models  would  have  se¬ 
lected  as  many  minority  applicants  as  did  the  admissions  committee,  Breland  and 
Ironson  (1976)  argue  against  the  adoption  of  any  psychometric  fairness  model, 
and  for  the  values  embodied  in  selection  committee  decisions.  This  is  a  curious 
argument  in  light  of  the  historical  fact  that  the  search  for  a  fairness  defini- 
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Cion  began  as  a  reaction  to  seemingly  unfair  personnel  and  admissions  decisions, 
particularly  in  the  South,  and  that  the  object  of  the  search  was  to  find  a  fair¬ 
ness  standard  by  which  admissions  and  selection  decisions  could  be  judged. 

Bias  in  test  content.  It  is  commonly  suspected  that  test  content  is  bi¬ 
ased.  Smeiser  and  Ferguson  (1978)  found  that  mean  scores  of  Whites  were  higher 
than  mean  scores  of  Blacks,  even  when  the  test  material  was  written  from  a  Black 
perspective.  In  the  Smeiser  and  Ferguson  (1978)  study,  the  cultural  information 
needed  to  answer  the  items  was  provided  in  reading  passages  and  did  not  need  to 
be  recalled  by  the  examinee.  Successfully  answering  the  items  depended  only  on 
correct  reasoning  with  the  information  given.  It  is,  however,  possible  to  re¬ 
duce  or  even  to  reverse  the  mean  difference  between  majority  and  minority  groups 
using  items  for  which  successful  completion  requires  recall  of  information  more 
commonly  available  in  the  minority  subculture  (Medley  &  Quirk  1974;  Williams 
1975).  No  one  appears  to  have  investigated  the  predictive  utility  of  tests  with 
content  constructed  so  as  to  reduce  subgroup  differences.  A  traditional  test 
would  presumably  be  better  than  such  a  nontraditional  test  for  predicting  tradi¬ 
tional  academic  or  employment  outcome  criteria  in  a  racially  heterogenous  popu¬ 
lation,  because  the  traditional  test  could  better  account  for  individual  differ¬ 
ences  in  criterion  performance  between  members  of  the  different  subpopulations. 
Jensen  (1974)  found  that  race  of  the  examiner  seldom  affected  mean  scores  of 
examinees.  In  the  one  exception,  both  Whites  and  Blacks  had  higher  scores  in 
the  presence  of  a  White  examiner. 

Alternatives  to  tests.  A  number  of  papers  discuss  non-pape r-and-pencil 
alternatives  to  tests,  primarily  subjective  evaluations  by  personnel  officers, 
employment  supervisors,  or  teachers  (i.e.  grades).  Arvey  (1979),  Hamner,  Kim, 
Baird,  and  Bigoness  (1974)  discuss  the  problems  of  bias  in  employer  evaluations. 
Cascio  (1976)  found  that  biographical  items  were  equally  valid  for  majority  and 
minority  applicants.  Goldman  and  Widawski  (1976)  point  out  that  minority  and 
majority  mean  differences  are  typically  smaller  on  high  school  grades  than  on 
standardized  tests.  They  suggest  that  in  some  settings,  selecting  among  college 
applicants  solely  on  the  basis  of  high  school  grades  would  increase  the  number 
of  minorities  selected  without  materially  affecting  the  validity  of  the  selec¬ 
tions. 

Fairness  to  Women 


Reed  (1976)  reviewed  the  sex  differences  literature  on  test  fairness  as 
defined  by  Cleary.  She  concluded  that  traditional  predictors  often  underpredict 
the  performance  of  college  achievement  for  females.  She  points  out  that  further 
investigation  is  needed  into  the  reasons  why.  For  example,  the  differences  in 
regression  lines  may  result  because  (1)  males  and  females  typically  enter  dif¬ 
ferent  fields  of  study  or  (2)  females  who  enter  college  are  a  more  select  sample 
of  the  female  population.  If  these  two  explanations  are  correct,  differences  in 
male  and  female  regression  lines  would  be  expected  to  decrease  as  more  women 
enter  male-dominated  college  majors  and  a  greater  proportion  of  women  enter  col¬ 
lege. 


Reilly,  Zedeck,  and  Tenopyr  (1979)  studied  physical  measures  (e.g.  arm 
strength,  height)  as  predictors  of  performance  in  an  outdoor  craft.  No  differ¬ 
ences  in  regression  lines  were  found,  suggesting  that  selection  based  on  such 
criteria  is  fair,  as  defined  by  Cleary  (1968).  In  studies  of  differential  va- 
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lidity.  Gross,  Faggen,  and  McCarthy  (1974)  and  Schmitt,  Melton,  and  Bylenga 
(1978)  £ound  traditional  measures  consistently  predicted  academic  criteria 
slightly  better  for  females  than  for  males.  Schmitt  et  al .  (1978)  found  the 
same  trend  for  employment  data,  but  the  number  of  employment  studies  reviewed 
was  small.  Moss  and  Brown  (1979)  found  that  varying  the  sex  referrant  in  read¬ 
ing  passages  did  not  significantly  change  the  reading  comprehension  scores  of 
males  and  females.  Since  the  information  needed  to  successfully  answer  the 
questions  seems  to  have  been  given  in  the  reading  passage,  the  result  is  not 
surprising. 

Mai-Dalton,  Feldman-Summers ,  and  Mitchell  (1979),  Simas  and  McCarrey 
(1979),  Arvey  (1979),  Hamner,  Kim,  Baird,  and  Bigoness  (1974)  have  examined  the 
fairness  to  women  of  employee  interviews  and  job  performance  evaluations.  The 
direction  and  degree  of  sex  bias  in  these  studies  appears  to  be  a  complex  inter¬ 
action  of  the  rater  sex,  ratee  sex,  and  job  characteristics.  Arvey  (1979)  notes 
that  employment  interviews  may  be  discriminatory  if  women  are  asked  different 
questions  (e.g.  What  will  you  and  your  husband  do  when  your  children  get  sick?). 


SUMMARY  AND  CONCLUSIONS 

The  period  1975  through  1979  has  had  considerable  activity  in  test  theory 
and  its  methods,  covering  a  diverse  range  of  topics.  Because  the  main  results 
and  methods  of  classical  test  theory  were  developed  and  refined  over  the  last  70 
years  or  so,  little  progress  was  made  in  classical  test  theory,  since  there  is 
little  progress  to  be  made.  The  period  saw  active  work  in  developing  alterna¬ 
tives  to  classical  test  theory.  Item  response  theory,  particularly  its  applica¬ 
tions  to  a  variety  of  testing  problems  inadequately  handled  by  classical  test 
theory,  has  been  the  subject  of  considerable  research  activity.  Methods  and 
procedures  for  both  the  Rasch  model  and  the  generalizations  of  the  Rasch  model 
to  more  complex  item  response  functions  have  been  the  objects  of  a  considerable 
amount  of  research.  Estimation  procedures  for  these  models  have  been  refined 
and  investigated,  and  the  robustness  of  the  estimation  procedures  has  been  stud¬ 
ied  under  a  variety  of  circumstances.  The  result  is  the  beginning  of  a  better 
appreciation  of  the  promise  and  limitations  of  these  models  and  their  areas  of 
application.  Progress  has  been  made  in  the  development  of  equating  procedures 
using  IRT  models  and  in  their  applications  to  adaptive  testing.  An  important 
new  field  of  research  that  has  developed  as  a  result  of  the  use  of  IRT  models  is 
the  area  of  person  fit  (person  reliability,  or  appropriateness  measurement), 
which  has  considerable  promise  for  applications  of  psychological  measurement  in 
practical  situations.  In  addition,  considerable  research  remains  yet  to  be  done 
on  IRT  models.  The  period  has  seen  a  needed  integration  between  test  theory 
approaches  based  on  IRT  and  other  models  of  psychological  measurement.  More 
work  is  needed  in  this  area  to  specify  and  to  describe  the  relationships  of  IRT 
models  to  other  areas  of  psychological  measurement,  in  order  to  reintegrate  psy¬ 
chological  testing  into  the  mainstream  of  psychology  and  its  measurement  proce¬ 
dures. 

There  has  been  more  research,  and  less  speculation,  about  the  utility  of 
criterion-referenced  tests  during  this  period.  Some  technical  advances  have 
been  made,  but  the  problem  of  the  arbitrariness  of  the  cutting  scores  still  re¬ 
mains  a  serious  limitation  to  important  applications  of  these  methods.  Order 
theory  has  developed  as  a  possible  viable  approach  to  psychological  testing,  but 
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considerable  additional  research  is  needed  before  it  can  be  shown  to  have  defi¬ 
nite  advantages  over  that  of  either  classical  test  theory  or  item  response  theo¬ 
ry.  No  studies  are  yet  available  comparing  order  theory  and  item  response  theo¬ 
ry  approaches  on  the  same  data  sets. 

Issues  of  test  fairness  have  received  considerable  attention.  The  litera¬ 
ture  has  focused  on  problems  of  item  and  test  bias  and  on  test  fairness  in  the 
study  of  differential  validity.  Before  the  issue  of  test  fairness  can  be  ade¬ 
quately  resolved,  the  problem  of  fairness  of  criteria  remains  yet  to  be  ad¬ 
dressed.  However,  the  search  continues  for  selection  devices  other  than  tests 
that  are  likely  to  be  less  unfair.  A  realistic  comparison  of  these  approaches, 
however,  would  include  evaluation  of  these  alternatives  on  the  same  criteria 
used  to  evaluate  the  tests  themselves. 

Some  progress  was  made  in  the  area  of  validation  by  the  use  of  structural 
equation  models,  particularly  in  the  analysis  of  multitrait-multimethod  matri¬ 
ces.  The  area  of  content  validity  was  somewhat  more  adequately  defined,  but  the 
issues  still  reduce  to  an  unacceptable  degree  of  individual  judgment  for  the 
definition  of  content  validity.  Some  research  during  the  period  has  contributed 
to  problems  in  the  understanding  of  predictive  validity. 

Thus,  similar  to  most  other  fields,  progress  comes  slowly.  Future  research 
in  test  theory  will  make  more  progress  if  less  emphasis  is  placed  on  relatively 
trivial  research  in  classical  test  theory  and  on  the  derivation  of  new  formulas 
for  already  known  concepts,  and  more  emphasis  is  placed  on  the  evaluation  of 
alternative  models  that  promise  considerable  improvement  in  the  design,  con¬ 
struction,  and  implementation  of  psychological  measuring  instruments. 
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